The whole experiment is two equations and one footnote.
Per-variant overall score, mean over all $S$ scorer rubrics applied to all $N$ scored items:
Inter-judge agreement, the rank correlation between two LLM judges $j_a$ and $j_b$ scoring the same $n$ items under the same rubric:
The footnote: we measured ρ for two of the judges in our default rotation and got 0.552. Below the 0.85 threshold we use for human-anchored calibration (prior post here). Below the 0.70 floor we’d treat as “moderate” agreement. The two judges aren’t measuring the same underlying quality.
That doesn’t make the headline arena score wrong. It makes it conditional. v2-atd 0.585 vs Llama 4 Scout 0.582 is a real measurement of how gemini-3.1-flash-lite-preview ranks the variants on the AskTheDoctor corpus with EXIT 4× compression. The corrected interpretation isn’t “v2-atd is tied with Llama 4 Scout.” It’s “they’re tied under this judge, and we now know we don’t know what they look like under a different one.”
This is the post.
The setup
We wanted one number. The product question we were trying to answer:
For the AskTheDoctor (ATD) medical Q&A corpus, with the production RAG vector group already deployed, which model + retrieval configuration produces the best answer for a typical patient question?
The arena:
- Suite: 200 held-out ATD validation items (
69ec98bcc410b34ce668849a) - Three models: v2-atd (our Modal QLoRA Gemma 4 31B fine-tune), Llama 4 Scout, Opus 4.7
- Two RAG configurations: uncompressed retrieval (Arena A) and EXIT 4× compressed retrieval (Arena B), with the prod RAG vector group at top-K=5, min-score=0.62
- Persona prefix attached at the release level (Release A — Fuhrman persona)
- Four scorers at equal weight = 1.0:
llm-correctness— does the response match the gold facts (legacy, strict)llm-completeness-coverage— covers the gold’s key points, length-tolerant (PR #1001)llm-relevance— does the response address the questionreference-perplexity:v2-atd-ppl— forced-decoding PPL of the response under v2-atd, length-normalized (PR #1000)
- Judge for the LLM-judged scorers:
gemini-3.1-flash-lite-preview(the default — calibration pending at the time)
EXIT 4× was added because v2-atd’s effective context window for our prod-shaped prompts is ~1600 tokens once the persona prefix and chat scaffolding are subtracted. Uncompressed top-5 retrieval blew past it on most items. The hypothesis we wanted to test was whether this was a v2-atd quality problem or a context-fit problem. If EXIT 4× compression brought retrieval inside the window without dropping accuracy, the answer was the latter.
EXIT 4× compression, simplified: keep the top-K retrieved chunks but compress each chunk by a factor of 4 using an extractive transformer (the EXIT model) before injecting into the prompt. Per the prior internal V&V, the expected accuracy delta is ~0pp at the 4× level for retrieval-augmented Q&A.
The headline result (and the per-scorer breakdown)
Arena B — EXIT 4× compressed RAG, error-filtered:
| Variant | n_total | n_errors | n_valid | mean (4-scorer overall) |
|---|---|---|---|---|
| v2-atd (Modal QLoRA Gemma 4 31B) | 212 | 28 | 212 | 0.585 |
| Llama 4 Scout | 203 | 17 | 203 | 0.582 |
Per-scorer breakdown — the actually-interesting result:
| Scorer | v2-atd | Llama 4 Scout | Δ (v2-atd − Llama) |
|---|---|---|---|
| llm-completeness-coverage (PR #1001, length-tolerant) | 0.565 | 0.534 | +0.031 |
| llm-correctness (legacy strict) | 0.443 | 0.458 | -0.015 |
| llm-relevance | 0.635 | 0.635 | 0.000 |
| reference-perplexity:v2-atd-ppl (PR #1000, forced-decoding PPL) | 0.700 | 0.699 | +0.001 |
The story the legacy 3-scorer headline missed. When we measure with the length-tolerant completeness scorer (the one specifically designed to not punish thoroughness), v2-atd wins by +3.1pp on completeness-coverage. The legacy llm-correctness scorer (which is strict and brevity-biased) gives Llama a -1.5pp edge on facts-match. The two cancel out in the unweighted overall mean (0.585 vs 0.582), but the underlying signal isn’t “tied — they’re equivalent.” The signal is “v2-atd is more thorough per-response; Llama is marginally more strictly accurate on individual facts.”
Reference-perplexity is essentially tied (0.700 vs 0.699) — both models produce text in-distribution under v2-atd’s own decoder. Expected: Llama’s outputs aren’t dramatically different from v2-atd’s stylometrically when both are constrained by the same RAG context.
Compression sensitivity: the EXIT 4× claim, validated on a customer corpus
Arena C measures the same model under both retrieval configurations:
| Variant | Arena A mean (uncompressed) | Arena B mean (EXIT 4×) | Δ |
|---|---|---|---|
| Llama 4 Scout | 0.578 | 0.582 | +0.004 |
Essentially zero. The prior internal “0pp loss at 4×” V&V on synthetic data holds on a real customer corpus. EXIT 4× is doing what it’s supposed to: shrinking the retrieved context to fit downstream models without measurably degrading the answer the next stage produces.
That was the variable that moved v2-atd from “looks like it can’t do RAG” to “looks roughly tied with Llama 4 Scout.” The gap wasn’t the model; it was the prompt budget.
Silent failure modes in LLM-as-judge pipelines
The arena pipeline initially produced a plausible-looking score for a third variant whose responses were not, in fact, real responses — they were upstream API error strings captured into the response field. The test runner counted non-exception calls as “passed” regardless of whether the body was a generated answer or an error blob.
The error strings then scored coherently on the rubrics. The relevance scorer correctly returned ~0 (the blob doesn’t address the question). The reference-perplexity scorer returned ~0.79 on the same content because short well-formed English has low perplexity regardless of whether it answers anything. The averaged headline mean landed in the “weak but plausible” zone instead of the “obvious failure” zone.
The generic lesson: an evaluation pipeline that doesn’t loudly distinguish model produced a bad answer from model wasn’t called and the error was captured as if it were an answer is producing numbers, not measurements. Any LLM-as-judge pipeline that scores response text without first validating the response is a response is exposed to this class of artifact. The fix is a pre-scoring validator that moves error-pattern responses out of the scored pool entirely and into a failure bucket. Cheap to implement; impossible to skip once you’ve seen what happens without it.
The judge calibration finding: ρ = 0.552
After the arena, we cross-checked ourselves. Sampled 30 (response, gold) pairs stratified across low/mid/high prior-score bins. Re-scored each pair with two judges using the exact llm-completeness-coverage prompt:
- Trusted baseline:
gemini-2.5-flash— the well-calibrated default per our prior judge-calibration work - Candidate upgrade:
gemini-3.1-pro-preview— a thinking model with ~282 thoughts tokens/call, meaningful instruction-following capacity for nuanced rubrics
Then we did the inter-judge ρ matrix on a larger 415-item pool:
gemini-2.5-flash | gemini-3.1-pro | |
|---|---|---|
gemini-2.5-flash | 1.000 (n=415) | 0.552 (n=413) |
gemini-3.1-pro | 0.552 (n=413) | 1.000 (n=413) |
ρ = 0.552 between two flagship Gemini judges scoring the same 413 answers under the same rubric.
For comparison, the threshold we use to trust a judge against a human anchor is ρ ≥ 0.85 (prior post on this). The floor we’d accept as “moderate consensus” is around 0.70. Two judges that nominally do the same job, given the same prompts and the same answers, are landing at 0.552.
The direction of disagreement is consistent though:
flash-lite-previewis the most lenient (mean ~1.0 on items where both newer judges scored 0.25-0.5).2.5-flashis moderate (mean 0.268 on the calibration sample).3.1-pro-previewis the strictest (mean 0.214).
So the “right answer” is somewhere across all three — but we don’t have it without a human anchor.
What this means for the arena number
It does not invalidate the arena number. It contextualizes it. The 0.585 vs 0.582 headline is real — but the interpretation is:
Under
gemini-3.1-flash-lite-previewas the judge for the LLM-judged scorers, on the AskTheDoctor 200-item held-out validation set with EXIT 4× compression on the prod RAG vector group, v2-atd scores 0.585 and Llama 4 Scout scores 0.582 on the unweighted 4-scorer mean. Re-running the same evaluation withgemini-2.5-flashwould shift both numbers by an unknown amount (likely several percentage points each) and the gap may flip sign.
That’s a defensible measurement. It’s not a judgment-free measurement, and we now have to write that down every time we report it.
The decision we made: don’t re-run the arena under a different judge until we have a human-anchored calibration set. Switching judges between arena runs is methodologically invalid — you’d be changing two variables at once (model under test and yardstick) and any difference is attributable to either. Stay on gemini-3.1-flash-lite-preview as the established (if imperfect) baseline. Use the corrected aggregator (error filter + dynamic scorer enumeration) for all future runs. Add the methods caveat to every published headline.
The sustainable fix is a human-labeled gold benchmark for the completeness-coverage rubric. ~50 (response, gold) pairs with a domain expert’s ratings on a 5-point scale (${0, 0.25, 0.5, 0.75, 1.0}$). Each candidate judge’s ρ vs the human gold gives an actual truth-anchor instead of judge-vs-judge noise. The expert in our case is Dr. Joel Fuhrman; the calibration session is queued; the result will land as an addendum here and in the Calibrating the Judge post.
The routing layer: why the calibration session pays twice
Calibration sessions do two things at once. The first is the one above — they tell us which judge to trust. The second is the part that ships into production: each calibration question carries up to $V$ variants, and the rater picks one as best overall. That winner pick records an asynchronous routing preference — the rater’s choice of RAG vector group for the question class that question belongs to. The next user query the platform classifies into the same question class biases toward the winning vector group.
Five winner picks later, the routing layer has learned the rater’s preference for that class. Fifty winner picks later, it covers the major question classes in the suite. The same human, the same 50 answers, two production systems improved.
Concretely, the loop:
- The rater answers the same question against (e.g.) three RAG vector groups: “core nutrition corpus” / “user-submitted Q&A” / “video transcripts.”
- They pick the answer from the “core nutrition corpus” as best overall.
- The next user asks a similar question → the platform routes there first.
- Five ratings later, the routing has learned the rater’s preference for that class.
The upper-bound cost of this is the same 30 minutes of expert time the calibration session already takes. The marginal cost of adding the routing-update side effect is one extra column in the calibration UI and one async write to the routing-preference store per rated answer.
The five-LLM scoreboard
A follow-on arena extended the comparison across five LLMs on the same suite, scored with a different three-scorer matrix optimized for the calibration page workflow (50-item suite; llm-similarity-to-expected, llm-factual-consistency-vs-reference, llm-question-addressed instead of the 4-scorer matrix above). The numbers below are not directly comparable to the 4-scorer Arena B means — different scorers, different item count — but they show the broader competitive shape on this corpus.
Sorted by overall mean, descending:
| LLM | Pass rate | Overall mean | Median latency |
|---|---|---|---|
| GPT-OSS-120B | 50/50 | 0.555 | 5.4 s |
| Kimi K2.6 | 49/50 | 0.553 | 16.2 s |
| Gemini 3.1 Pro | 49/50 | 0.530 | 18.9 s |
| Claude Opus 4.7 (Vertex) | 50/50 | 0.473 | 3.3 s |
| DFO QLoRA / DFlash baseline | 50/50 | 0.398 | 3.0 s |
Per-scorer breakdown for Claude Opus 4.7 (the only model with the new-matrix per-scorer numbers landed so far): llm-similarity-to-expected = 0.380, llm-factual-consistency-vs-reference = 0.403, llm-question-addressed = 0.600. Comparable per-scorer breakdowns for the other four models are pending the next sweep — that’s the chart this section is missing.
The pattern that’s visible even without per-scorer detail: a frontier model that hedges (answers cautiously without a persona prefix to anchor on) reads as moderate on similarity-to-expected but high on question-addressed. The 0.600 vs 0.380 spread inside a single model is the signal that human-anchored calibration is what’s needed to resolve “is this answer good, or just safe?”
In summary
The arena answered the product question we asked. v2-atd is essentially tied with Llama 4 Scout on the AskTheDoctor corpus when both have RAG access at a configuration that fits v2-atd’s prompt budget. The “v2-atd can’t do RAG” hypothesis was a context-window blocker; EXIT 4× resolves it with measurably zero accuracy cost.
The arena also answered a question we didn’t ask and would have preferred to keep tacit: two flagship LLM judges scoring the same answers under the same rubric land at ρ = 0.552. That’s not a defect of either judge. It’s a measurement of how much “LLM-as-judge” is doing — and it’s the case for every arena report carrying its judge identity in the methods caveat until a human anchor exists.
When that anchor lands, the same calibration session pays for itself twice: the LLM judges get blessed (or not) for scoring the rest of the suite, and the production routing layer learns which RAG vector group the human prefers for each question class. One human session, two systems improved.
The scaffolding to make this work is shipped: a unified calibration session that accepts ratings from web, CLI, MCP, or agent surfaces; the multi-scorer arena; the dynamic aggregator; the error-pattern filter. The remaining piece is the human anchor.
References
- RAGAS — automated RAG evaluation. Es, James, Espinosa-Anke, Schockaert, RAGAS: Automated Evaluation of Retrieval Augmented Generation (arXiv:2309.15217). The reference-free RAG eval framework; the arena described in this post is a corpus-specific, judge-pluggable extension that measures the same kinds of axes (faithfulness, answer relevance, context relevance) plus a final-answer Spearman ρ between judges.
- BEIR — heterogeneous IR benchmark. Thakur, Reimers, Rücklé, Srivastava, Gurevych, BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models (NeurIPS Datasets & Benchmarks 2021, arXiv:2104.08663). The convention this arena follows for splitting evaluation by retrieval style and document type.
- MT-Bench & LLM-as-judge agreement. Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023, arXiv:2306.05685). Reports >80% overall GPT-4-vs-human agreement, with wide per-category variance. The ρ = 0.552 between two judges in this post is consistent with that variance once you slice by domain.
- Spearman ρ — definition. Spearman's rank correlation coefficient. The arena's per-scorer agreement chart and the headline ρ = 0.552 finding both use Spearman because routing decisions consume rank order, not absolute scores.
- Internal arena data — the three charts above. The AskTheDoctor 200-item medical corpus, the v2-atd / Llama 4 Scout / EXIT-4× tie outcome, the per-scorer disagreement table, and the ρ = 0.552 inter-judge measurement are all from the Divinci-AI ScoredQA platform. The arena and its routing layer are described in the companion Calibrating the AI Judge post; the slice-aware Spearman gate that this calibration feeds is documented in the release-pipeline post.
Ready to Build Your Custom AI Solution?
Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.
Get Started Today
