Your Eval Is Broken for the Same Reason Your LLM Is
"If you can't measure it, you can't improve it"
We didn't build RCE because we had a clever idea. We built it because we finally learned to measure what was actually broken.
For months, we knew something was wrong. Users told us. But we couldn't point at what. Once we learned to measure context utilization, reasoning fidelity - the problems became visible. And visible problems have solutions.
This is the story of learning what to measure. The interventions came after.
Every problem with LLMs boils down to two things:
- The right prompt
- The right context
This is true for your product. It's true for your agents. And it's true for evaluation itself.
We learned this the hard way - spending months on evaluation approaches that didn't work before realizing: the same principles that make LLMs work also make LLM evaluation work.
The Core Insight
When your LLM gives a bad response, it's one of two problems:
| Problem | What It Looks Like |
|---|---|
| Wrong prompt | Model misunderstands the task, wrong format, ignores constraints |
| Wrong context | Model lacks information, uses stale data, hallucinates to fill gaps |
Evaluation is no different. When your evals give misleading results, it's the same two problems:
| Eval Problem | Root Cause |
|---|---|
| Inconsistent scores | Prompt doesn't constrain the judge well enough |
| Scores don't match human judgment | Context missing - judge doesn't know what "good" means for your domain |
| High scores but users still complain | Evaluating the wrong thing - prompt asks the wrong question |
Fix the prompt. Fix the context. The rest follows.
LLM as a Judge
Using an LLM to evaluate LLM outputs is powerful - but not universally applicable.
When LLM-as-Judge Works
| Use Case | Why It Works |
|---|---|
| Style and tone | Subjective but pattern-matchable |
| Instruction following | Clear criteria, binary-ish |
| Coherence and fluency | LLMs are trained on this |
| Relative comparison | "Which is better: A or B?" |
| Reasoning chain validity | Can trace logic steps |
When LLM-as-Judge Fails
| Use Case | Why It Fails |
|---|---|
| Factual accuracy | Judge hallucinates same as subject |
| Domain expertise | Judge doesn't know your domain better than the model being judged |
| Numerical precision | LLMs are bad at math, including counting errors |
| Subtle correctness | "Almost right" looks right to the judge |
| Novel failure modes | Judge has same blind spots |
The Judge's Blind Spot
An LLM judge will confidently score a response as correct when:
- It sounds authoritative
- It matches the judge's training distribution
- The error requires domain expertise to catch
This is why domain-specific eval needs domain-specific ground truth - not another LLM.
Use a Different Model Family
If your product uses GPT-4, don't use GPT-4 as the judge.
Same model family means:
- Same training data → same biases
- Same blind spots → same errors go unnoticed
- Same "sounds right" patterns → mutual agreement on wrong answers
When your production model makes a subtle error, a judge from the same family will likely think it's correct - they learned the same patterns of what "sounds authoritative."
| Setup | Problem |
|---|---|
| GPT-4 in prod, GPT-4 as judge | Shared blind spots, errors look correct to both |
| Claude in prod, Claude as judge | Same issue - correlated failures |
| GPT-4 in prod, Claude as judge | Different training → independent evaluation |
Different model families have different:
- Training data distributions
- Failure modes
- "Sounds right" heuristics
This independence is what makes cross-family evaluation valuable. The judge catches errors the production model is blind to - because they don't share the same blindness.
Caveat: This doesn't guarantee correctness. Both could be wrong in different ways. But it reduces correlated failures, which is the real enemy of evaluation.
Choosing the Right Units
Not everything is a 1-10 scale. Not everything is pass/fail.
Binary Metrics
Best for clear-cut criteria:
| Metric | Binary? | Why |
|---|---|---|
| Did it use the user's name correctly? | Yes | Either right or wrong |
| Did it follow the format constraint? | Yes | JSON or not JSON |
| Did it refuse when it should have? | Yes | Refusal happened or didn't |
| Is the response "good"? | No | Too subjective for binary |
Scaled Metrics
Best for quality gradients:
| Metric | Scale | Anchors |
|---|---|---|
| Helpfulness | 1-5 | 1 = useless, 3 = acceptable, 5 = exceptional |
| Factual accuracy | 1-5 | 1 = wrong, 3 = mostly correct, 5 = fully accurate |
| Reasoning quality | 1-5 | 1 = incoherent, 3 = sound, 5 = insightful |
Critical: Define anchors. "4 out of 5" means nothing without calibration.
Decomposed Metrics
When a single score hides too much:
Overall Response Quality: 3.5/5
Breakdown:
- Factual accuracy: 4/5 (one minor error)
- Relevance: 5/5 (directly addressed question)
- Completeness: 2/5 (missed key aspect)
- Tone: 4/5 (slightly formal for context)
The aggregate hides the completeness problem. Decomposition reveals it.
Evaluating Conversations
Single-turn eval is easy. Conversation eval is where it gets hard.
The Session Problem
A conversation isn't just N independent turns. It has:
- State accumulation - context builds over turns
- Goal progression - user is trying to accomplish something
- Recovery opportunities - bad turn can be fixed later
- Compounding errors - bad turn can derail everything
What to Measure at Session Level
| Metric | What It Captures |
|---|---|
| Task completion | Did the user accomplish their goal? |
| Turns to completion | Efficiency - fewer is usually better |
| Recovery rate | When it went wrong, did it recover? |
| Drop-off point | Where do users abandon? |
| Context coherence | Did the agent maintain accurate state? |
Turn-Level vs Session-Level
| Turn-Level | Session-Level |
|---|---|
| "Was this response good?" | "Was this conversation successful?" |
| High scores possible with bad outcome | Captures actual user success |
| Easier to measure | Harder to define |
| Useful for debugging | Useful for product metrics |
You need both. Turn-level tells you where it broke. Session-level tells you if it broke.
Getting the Prompt Right (For Evaluation)
The eval prompt is your rubric. A vague prompt gives inconsistent results.
Bad Eval Prompt
Rate this response from 1-5 for quality.
Problems:
- What is "quality"?
- No anchors for scores
- No context about domain/task
- Different judges will interpret differently
Better Eval Prompt
You are evaluating a response from a financial planning AI assistant.
Context:
- The user asked about retirement savings options
- The user's profile shows: [age, income, risk tolerance, current savings]
- The assistant should reference specific tax-advantaged accounts
Evaluate the response on FACTUAL ACCURACY only:
1 = Contains major factual errors about finance or the user's situation
2 = Contains minor factual errors
3 = Factually correct but generic (doesn't use user's specific profile)
4 = Factually correct and uses user's profile data
5 = Factually correct, uses profile data, and makes accurate inferences
Response to evaluate:
[response]
Score (1-5):
Reasoning:
This works because:
- Single dimension (factual accuracy)
- Clear anchors with examples
- Domain context provided
- Requires reasoning (catches lazy scoring)
Tuning for False Positives vs False Negatives
Your eval prompt implicitly sets a threshold. You can tune it to be strict or lenient - and the right choice depends on what you're optimizing for.
| Tuning direction | What happens | Prompt language |
|---|---|---|
| More false positives (strict) | Flags good responses as bad | "If there's ANY doubt, mark incorrect" |
| More false negatives (lenient) | Lets bad responses through | "Only mark incorrect if clearly, obviously wrong" |
When to be strict (prefer false positives):
| Domain | Why |
|---|---|
| Safety / compliance | Missing harmful content is worse than over-flagging |
| Medical / financial accuracy | Wrong info has real consequences |
| Legal / regulatory | Conservative is safer |
When you're strict, you accept that some good responses get flagged. That's fine - you'd rather manually review false alarms than miss real problems.
When to be lenient (tolerate false negatives):
| Domain | Why |
|---|---|
| Style / tone | Subjective, over-flagging kills creativity |
| Helpfulness | Don't penalize unconventional but correct answers |
| Exploratory responses | Let the model take risks |
When you're lenient, you accept that some bad responses slip through. That's fine - you'd rather not suppress good responses in the name of catching every edge case.
The key insight: There's no "correct" threshold. It depends on the cost of each error type:
Cost of false positive = good response marked bad = user sees rejection, or you waste time reviewing
Cost of false negative = bad response marked good = user sees bad output, trust erodes
Tune your prompt language to reflect which cost you're optimizing against.
Getting the Context Right (For Evaluation)
The judge needs the same context the model had - plus ground truth.
What the Judge Needs
| Context | Why |
|---|---|
| Original user input | What was the task? |
| System prompt | What were the constraints? |
| Retrieved/injected context | What information was available? |
| Ground truth (if available) | What's the right answer? |
| Domain rubric | What does "good" mean here? |
Common Context Mistakes
| Mistake | Result |
|---|---|
| Judge doesn't see system prompt | Penalizes responses that followed instructions |
| Judge doesn't see retrieved context | Can't assess context utilization |
| No ground truth for factual claims | Judge hallucinates correctness |
| Generic rubric for domain task | Scores don't reflect domain quality |
The Evaluation Stack
Putting it together:
┌─────────────────────────────────────────────────┐
│ What to Evaluate │
├─────────────────────────────────────────────────┤
│ Turn-level: individual response quality │
│ Session-level: conversation success │
│ Dimension-level: specific quality aspects │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ How to Evaluate │
├─────────────────────────────────────────────────┤
│ LLM-as-judge: style, coherence, relative rank │
│ Programmatic: format, constraints, keywords │
│ Human expert: domain accuracy, subtle quality │
│ User signal: completion, retention, feedback │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Evaluation Quality │
├─────────────────────────────────────────────────┤
│ Right prompt: specific, anchored, reasoned │
│ Right context: full info, ground truth, rubric │
│ Right units: binary/scaled/decomposed │
└─────────────────────────────────────────────────┘
What We Learned
Going from 60% to 90% wasn't about finding a better model. It was about:
- Recognizing that eval is the same problem - prompt and context
- Knowing when LLM-as-judge works - and when it doesn't
- Choosing the right units - binary, scaled, or decomposed
- Evaluating sessions, not just turns - because users have goals
- Treating eval prompts as rubrics - specific, anchored, domain-aware
The 30% gap was hiding in vague prompts, missing context, and aggregate scores that hid failure modes.
Built at Anuvaya. We're building AI in India - if this resonates, we'd love to hear from you.