The Challenge
How do you evaluate a large language model when static benchmarks become stale, contaminated, or simply can't keep up with real-world deployment conditions? Traditional evaluation relies on human-authored tasks, reference answers, and human judgments—approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis.
Why This Matters to Caura
At Caura, we believe organizations shouldn't be locked into a single AI provider. Our LLM Router enables intelligent model selection across GPT, Claude, Gemini, and emerging models—routing each task to the optimal model based on quality, speed, and cost. But intelligent routing requires intelligent evaluation: how do we know which model performs best for which task, especially as models evolve weekly and benchmarks lag months behind?
This question led us to PeerRank.
Introducing PeerRank
We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses, and aggregate dense peer assessments into relative performance estimates—all without human supervision or gold references.
The Key Insight: Let Models Evaluate Each Other
PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments through systematic controls. Instead of a single privileged judge or human oracle, PeerRank distributes judging across all participating models and explicitly measures the biases that emerge.
How PeerRank Works
The PeerRank pipeline operates in four distinct phases, each designed to maintain fairness and eliminate confounds:
Phase 1: Endogenous Question Generation
Each model independently generates 35 questions across five categories: factual knowledge, reasoning/logic, current events, creative/open-ended, and practical how-to. Questions are used as-generated without filtering, deduplication, or human editing—ensuring the task distribution is defined endogenously by participating models.
Phase 2: Web-Grounded Answer Generation
All models answer all questions with web access enabled. For consistency, we use a single external retrieval provider (e.g., Tavily or SerpAPI) uniformly across all models—eliminating provider-specific tool differences as a confound. Web grounding is applied only to current events questions to isolate the effect of live information retrieval.
Phase 3: Bias-Controlled Peer Evaluation
Each model evaluates all answers using a standardized 1-10 rubric emphasizing correctness, completeness, clarity, and usefulness. Critically, web grounding is disabled during evaluation—judges score only the submitted answer, not extra evidence retrieved at scoring time.
To control systematic judge effects, evaluations are conducted under three regimes:
Self Bias
Do models rate their own answers higher than peers rate them? We measure this by comparing self-scores to peer-assigned scores.
Name (Identity) Bias
Does knowing which model produced an answer affect scores? We compare scores when identities are visible vs. hidden.
Position Bias
Does the order in which answers appear affect scores? We shuffle answer order and measure the effect.
The shuffle+blind regime (randomized order, hidden identities) serves as the least-confounded baseline for final rankings.
Results: Stable, Discriminative Rankings
In our large-scale study of 12 commercially available models (including GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, and others), PeerRank produces stable and discriminative rankings:
| Rank | Model | Peer Score |
|---|---|---|
| 1 | claude-opus-4-5 | 8.69 |
| 2 | gpt-5.2 | 8.66 |
| 3 | gpt-5-mini | 8.62 |
| 4 | claude-sonnet-4-5 | 8.53 |
| 5 | kimi-k2.5 | 8.50 |
| 6 | gemini-3-pro-preview | 8.37 |
| 7 | deepseek-chat | 8.35 |
| 8 | mistral-large | 8.34 |
| 9 | gemini-3-flash-preview | 8.01 |
| 10 | grok-4-1-fast | 7.66 |
| 11 | sonar-pro | 7.23 |
| 12 | llama-4-maverick | 7.21 |
Rankings show strong agreement with Elo-based aggregation (Pearson r = 0.844, Spearman ρ = 0.755), suggesting findings are stable across aggregation schemes.
External Validation: Peer Scores Track Ground Truth
A critical question for any peer-based evaluation system: do peer scores actually reflect objective correctness, or could a cohort converge on preferences that don't track reality?
We validated PeerRank against two external benchmarks with ground truth:
TruthfulQA Validation
Peer scores correlate strongly with ground-truth accuracy on TruthfulQA multiple-choice questions: Pearson r = 0.904 (p = 0.0004) and Spearman ρ = 0.881. This confirms that blind peer evaluation reliably distinguishes truthful from hallucinated responses.
GSM8K Math Validation
On 611 medium and hard math problems from GSM8K, peer scores correlate with exact-match accuracy: Pearson r = 0.873 (p = 0.0002) and Spearman ρ = 0.763. External validity extends beyond factual QA to structured mathematical reasoning.
Key Finding: Peer Evaluation Outperforms Self-Evaluation
An important ablation study reveals that models cannot reliably judge their own quality. When comparing correlation with TruthfulQA ground truth:
- Peer cross-evaluation: Pearson r = 0.905, Spearman ρ = 0.881
- Self-evaluation: Pearson r = 0.538
The gap of +0.37 in correlation highlights that models are substantially better at judging peers than reliably judging themselves. This validates the fundamental design choice of PeerRank: distribute evaluation across heterogeneous judges rather than relying on any single evaluator.
Bias as a First-Class Measurement
PeerRank doesn't just produce rankings—it explicitly measures and reports evaluation biases:
- Self bias is typically positive (most models overrate their own answers), but heterogeneous—some models show near-zero or even negative self-preference
- Name bias is non-trivial: more recognizable model identities receive higher scores when visible
- Position bias is measurable: answers shown first receive a +0.39 score lift on average
By treating bias as an explicit measurement target rather than a hidden confounder, PeerRank enables transparent, reproducible evaluation.
Stronger Models Judge More Harshly
An intriguing finding: judge generosity is negatively correlated with peer performance (Pearson r = -0.755). Higher-ranked models tend to assign lower scores on average, suggesting they apply tighter standards for correctness and rigor when judging peers.
This heterogeneity in judging style underscores the importance of aggregating across diverse judges rather than privileging any single evaluator model.
Implications for LLM Evaluation
PeerRank suggests several implications for how we evaluate AI systems:
- Peer-based evaluation can scale: With proper bias controls, distributed peer evaluation serves as a viable complement—or alternative—to human-anchored benchmarks
- Bias must be measured, not ignored: Naive LLM-as-a-judge pipelines risk systematic distortion from position, identity, and self-preference effects
- Open-world evaluation matters: Web-grounded evaluation introduces realistic variance that static benchmarks suppress—Current Events questions drove the most disagreement in our study
- No single judge suffices: Aggregation across heterogeneous judges yields stable rankings; relying on a single evaluator model introduces uncontrolled bias
Limitations and Future Work
PeerRank evaluates models relative to the participating population—scores are not absolute and shouldn't be compared across disjoint runs without calibration. Other limitations include:
- Task distribution reflects generator capabilities and may underrepresent certain domains
- API latency reflects server load, not purely model computation
- Judges may weight rubric criteria differently despite standardization
Future work will study sensitivity to prompt design, quantify the impact of different web grounding tools, and analyze how rankings differ between reasoning and non-reasoning models.
Open Source
Code, prompts, and the full dataset are available at github.com/caura-ai/caura-PeerRank. We encourage the community to reproduce, extend, and build upon this work.
Explore PeerRank
PeerRank represents a step toward scalable, bias-aware, open-world LLM evaluation. Read the full paper for detailed methodology, additional results, and complete prompts.
Read Paper on arXiv View on GitHub