PeerRank: Autonomous LLM Evaluation Through Peer Review

The Challenge

How do you evaluate a large language model when static benchmarks become stale, contaminated, or simply can't keep up with real-world deployment conditions? Traditional evaluation relies on human-authored tasks, reference answers, and human judgments—approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis.

Why This Matters to Caura

At Caura, we believe organizations shouldn't be locked into a single AI provider. Our LLM Router enables intelligent model selection across GPT, Claude, Gemini, and emerging models—routing each task to the optimal model based on quality, speed, and cost. But intelligent routing requires intelligent evaluation: how do we know which model performs best for which task, especially as models evolve weekly and benchmarks lag months behind?

This question led us to PeerRank.

Introducing PeerRank

We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses, and aggregate dense peer assessments into relative performance estimates—all without human supervision or gold references.

The Key Insight: Let Models Evaluate Each Other

PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments through systematic controls. Instead of a single privileged judge or human oracle, PeerRank distributes judging across all participating models and explicitly measures the biases that emerge.

Models Evaluated

420

Autonomous Questions

0.90

Correlation with TruthfulQA

0.87

Correlation with GSM8K

How PeerRank Works

The PeerRank pipeline operates in four distinct phases, each designed to maintain fairness and eliminate confounds:

Question Generation

Each model generates evaluation questions across 5 categories

→

Web-Grounded Answering

All models answer all questions with live web access

→

Peer Evaluation

Each model judges all answers (web disabled, bias controls active)

→

Aggregation

Scores aggregated into rankings + bias metrics

Phase 1: Endogenous Question Generation

Each model independently generates 35 questions across five categories: factual knowledge, reasoning/logic, current events, creative/open-ended, and practical how-to. Questions are used as-generated without filtering, deduplication, or human editing—ensuring the task distribution is defined endogenously by participating models.

Phase 2: Web-Grounded Answer Generation

All models answer all questions with web access enabled. For consistency, we use a single external retrieval provider (e.g., Tavily or SerpAPI) uniformly across all models—eliminating provider-specific tool differences as a confound. Web grounding is applied only to current events questions to isolate the effect of live information retrieval.

Phase 3: Bias-Controlled Peer Evaluation

Each model evaluates all answers using a standardized 1-10 rubric emphasizing correctness, completeness, clarity, and usefulness. Critically, web grounding is disabled during evaluation—judges score only the submitted answer, not extra evidence retrieved at scoring time.

To control systematic judge effects, evaluations are conducted under three regimes:

Self Bias

Do models rate their own answers higher than peers rate them? We measure this by comparing self-scores to peer-assigned scores.

Name (Identity) Bias

Does knowing which model produced an answer affect scores? We compare scores when identities are visible vs. hidden.

Position Bias

Does the order in which answers appear affect scores? We shuffle answer order and measure the effect.

The shuffle+blind regime (randomized order, hidden identities) serves as the least-confounded baseline for final rankings.

Results: Stable, Discriminative Rankings

In our large-scale study of 12 commercially available models (including GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, and others), PeerRank produces stable and discriminative rankings:

Rank	Model	Peer Score
1	claude-opus-4-5	8.69
2	gpt-5.2	8.66
3	gpt-5-mini	8.62
4	claude-sonnet-4-5	8.53
5	kimi-k2.5	8.50
6	gemini-3-pro-preview	8.37
7	deepseek-chat	8.35
8	mistral-large	8.34
9	gemini-3-flash-preview	8.01
10	grok-4-1-fast	7.66
11	sonar-pro	7.23
12	llama-4-maverick	7.21

Rankings show strong agreement with Elo-based aggregation (Pearson r = 0.844, Spearman ρ = 0.755), suggesting findings are stable across aggregation schemes.

External Validation: Peer Scores Track Ground Truth

A critical question for any peer-based evaluation system: do peer scores actually reflect objective correctness, or could a cohort converge on preferences that don't track reality?

We validated PeerRank against two external benchmarks with ground truth:

TruthfulQA Validation

Peer scores correlate strongly with ground-truth accuracy on TruthfulQA multiple-choice questions: Pearson r = 0.904 (p = 0.0004) and Spearman ρ = 0.881. This confirms that blind peer evaluation reliably distinguishes truthful from hallucinated responses.

GSM8K Math Validation

On 611 medium and hard math problems from GSM8K, peer scores correlate with exact-match accuracy: Pearson r = 0.873 (p = 0.0002) and Spearman ρ = 0.763. External validity extends beyond factual QA to structured mathematical reasoning.

Key Finding: Peer Evaluation Outperforms Self-Evaluation

An important ablation study reveals that models cannot reliably judge their own quality. When comparing correlation with TruthfulQA ground truth:

Peer cross-evaluation: Pearson r = 0.905, Spearman ρ = 0.881
Self-evaluation: Pearson r = 0.538

The gap of +0.37 in correlation highlights that models are substantially better at judging peers than reliably judging themselves. This validates the fundamental design choice of PeerRank: distribute evaluation across heterogeneous judges rather than relying on any single evaluator.

Bias as a First-Class Measurement

PeerRank doesn't just produce rankings—it explicitly measures and reports evaluation biases:

Self bias is typically positive (most models overrate their own answers), but heterogeneous—some models show near-zero or even negative self-preference
Name bias is non-trivial: more recognizable model identities receive higher scores when visible
Position bias is measurable: answers shown first receive a +0.39 score lift on average

By treating bias as an explicit measurement target rather than a hidden confounder, PeerRank enables transparent, reproducible evaluation.

Stronger Models Judge More Harshly

An intriguing finding: judge generosity is negatively correlated with peer performance (Pearson r = -0.755). Higher-ranked models tend to assign lower scores on average, suggesting they apply tighter standards for correctness and rigor when judging peers.

This heterogeneity in judging style underscores the importance of aggregating across diverse judges rather than privileging any single evaluator model.

Implications for LLM Evaluation

PeerRank suggests several implications for how we evaluate AI systems:

Peer-based evaluation can scale: With proper bias controls, distributed peer evaluation serves as a viable complement—or alternative—to human-anchored benchmarks
Bias must be measured, not ignored: Naive LLM-as-a-judge pipelines risk systematic distortion from position, identity, and self-preference effects
Open-world evaluation matters: Web-grounded evaluation introduces realistic variance that static benchmarks suppress—Current Events questions drove the most disagreement in our study
No single judge suffices: Aggregation across heterogeneous judges yields stable rankings; relying on a single evaluator model introduces uncontrolled bias

Limitations and Future Work

PeerRank evaluates models relative to the participating population—scores are not absolute and shouldn't be compared across disjoint runs without calibration. Other limitations include:

Task distribution reflects generator capabilities and may underrepresent certain domains
API latency reflects server load, not purely model computation
Judges may weight rubric criteria differently despite standardization

Future work will study sensitivity to prompt design, quantify the impact of different web grounding tools, and analyze how rankings differ between reasoning and non-reasoning models.

Open Source

Code, prompts, and the full dataset are available at github.com/caura-ai/caura-PeerRank. We encourage the community to reproduce, extend, and build upon this work.

Explore PeerRank

PeerRank represents a step toward scalable, bias-aware, open-world LLM evaluation. Read the full paper for detailed methodology, additional results, and complete prompts.

Read Paper on arXiv View on GitHub

PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review