Your Eval Is Broken for the Same Reason Your LLM Is

"If you can't measure it, you can't improve it"

We didn't build RCE because we had a clever idea. We built it because we finally learned to measure what was actually broken.

For months, we knew something was wrong. Users told us. But we couldn't point at what. Once we learned to measure context utilization, reasoning fidelity - the problems became visible. And visible problems have solutions.

This is the story of learning what to measure. The interventions came after.


Every problem with LLMs boils down to two things:

  1. The right prompt
  2. The right context

This is true for your product. It's true for your agents. And it's true for evaluation itself.

We learned this the hard way - spending months on evaluation approaches that didn't work before realizing: the same principles that make LLMs work also make LLM evaluation work.


The Core Insight

When your LLM gives a bad response, it's one of two problems:

ProblemWhat It Looks Like
Wrong promptModel misunderstands the task, wrong format, ignores constraints
Wrong contextModel lacks information, uses stale data, hallucinates to fill gaps

Evaluation is no different. When your evals give misleading results, it's the same two problems:

Eval ProblemRoot Cause
Inconsistent scoresPrompt doesn't constrain the judge well enough
Scores don't match human judgmentContext missing - judge doesn't know what "good" means for your domain
High scores but users still complainEvaluating the wrong thing - prompt asks the wrong question

Fix the prompt. Fix the context. The rest follows.


LLM as a Judge

Using an LLM to evaluate LLM outputs is powerful - but not universally applicable.

When LLM-as-Judge Works

Use CaseWhy It Works
Style and toneSubjective but pattern-matchable
Instruction followingClear criteria, binary-ish
Coherence and fluencyLLMs are trained on this
Relative comparison"Which is better: A or B?"
Reasoning chain validityCan trace logic steps

When LLM-as-Judge Fails

Use CaseWhy It Fails
Factual accuracyJudge hallucinates same as subject
Domain expertiseJudge doesn't know your domain better than the model being judged
Numerical precisionLLMs are bad at math, including counting errors
Subtle correctness"Almost right" looks right to the judge
Novel failure modesJudge has same blind spots

The Judge's Blind Spot

An LLM judge will confidently score a response as correct when:

  • It sounds authoritative
  • It matches the judge's training distribution
  • The error requires domain expertise to catch

This is why domain-specific eval needs domain-specific ground truth - not another LLM.

Use a Different Model Family

If your product uses GPT-4, don't use GPT-4 as the judge.

Same model family means:

  • Same training data → same biases
  • Same blind spots → same errors go unnoticed
  • Same "sounds right" patterns → mutual agreement on wrong answers

When your production model makes a subtle error, a judge from the same family will likely think it's correct - they learned the same patterns of what "sounds authoritative."

SetupProblem
GPT-4 in prod, GPT-4 as judgeShared blind spots, errors look correct to both
Claude in prod, Claude as judgeSame issue - correlated failures
GPT-4 in prod, Claude as judgeDifferent training → independent evaluation

Different model families have different:

  • Training data distributions
  • Failure modes
  • "Sounds right" heuristics

This independence is what makes cross-family evaluation valuable. The judge catches errors the production model is blind to - because they don't share the same blindness.

Caveat: This doesn't guarantee correctness. Both could be wrong in different ways. But it reduces correlated failures, which is the real enemy of evaluation.


Choosing the Right Units

Not everything is a 1-10 scale. Not everything is pass/fail.

Binary Metrics

Best for clear-cut criteria:

MetricBinary?Why
Did it use the user's name correctly?YesEither right or wrong
Did it follow the format constraint?YesJSON or not JSON
Did it refuse when it should have?YesRefusal happened or didn't
Is the response "good"?NoToo subjective for binary

Scaled Metrics

Best for quality gradients:

MetricScaleAnchors
Helpfulness1-51 = useless, 3 = acceptable, 5 = exceptional
Factual accuracy1-51 = wrong, 3 = mostly correct, 5 = fully accurate
Reasoning quality1-51 = incoherent, 3 = sound, 5 = insightful

Critical: Define anchors. "4 out of 5" means nothing without calibration.

Decomposed Metrics

When a single score hides too much:

Overall Response Quality: 3.5/5

Breakdown:
- Factual accuracy: 4/5 (one minor error)
- Relevance: 5/5 (directly addressed question)
- Completeness: 2/5 (missed key aspect)
- Tone: 4/5 (slightly formal for context)

The aggregate hides the completeness problem. Decomposition reveals it.


Evaluating Conversations

Single-turn eval is easy. Conversation eval is where it gets hard.

The Session Problem

A conversation isn't just N independent turns. It has:

  • State accumulation - context builds over turns
  • Goal progression - user is trying to accomplish something
  • Recovery opportunities - bad turn can be fixed later
  • Compounding errors - bad turn can derail everything

What to Measure at Session Level

MetricWhat It Captures
Task completionDid the user accomplish their goal?
Turns to completionEfficiency - fewer is usually better
Recovery rateWhen it went wrong, did it recover?
Drop-off pointWhere do users abandon?
Context coherenceDid the agent maintain accurate state?

Turn-Level vs Session-Level

Turn-LevelSession-Level
"Was this response good?""Was this conversation successful?"
High scores possible with bad outcomeCaptures actual user success
Easier to measureHarder to define
Useful for debuggingUseful for product metrics

You need both. Turn-level tells you where it broke. Session-level tells you if it broke.


Getting the Prompt Right (For Evaluation)

The eval prompt is your rubric. A vague prompt gives inconsistent results.

Bad Eval Prompt

Rate this response from 1-5 for quality.

Problems:

  • What is "quality"?
  • No anchors for scores
  • No context about domain/task
  • Different judges will interpret differently

Better Eval Prompt

You are evaluating a response from a financial planning AI assistant.

Context:
- The user asked about retirement savings options
- The user's profile shows: [age, income, risk tolerance, current savings]
- The assistant should reference specific tax-advantaged accounts

Evaluate the response on FACTUAL ACCURACY only:
1 = Contains major factual errors about finance or the user's situation
2 = Contains minor factual errors
3 = Factually correct but generic (doesn't use user's specific profile)
4 = Factually correct and uses user's profile data
5 = Factually correct, uses profile data, and makes accurate inferences

Response to evaluate:
[response]

Score (1-5):
Reasoning:

This works because:

  • Single dimension (factual accuracy)
  • Clear anchors with examples
  • Domain context provided
  • Requires reasoning (catches lazy scoring)

Tuning for False Positives vs False Negatives

Your eval prompt implicitly sets a threshold. You can tune it to be strict or lenient - and the right choice depends on what you're optimizing for.

Tuning directionWhat happensPrompt language
More false positives (strict)Flags good responses as bad"If there's ANY doubt, mark incorrect"
More false negatives (lenient)Lets bad responses through"Only mark incorrect if clearly, obviously wrong"

When to be strict (prefer false positives):

DomainWhy
Safety / complianceMissing harmful content is worse than over-flagging
Medical / financial accuracyWrong info has real consequences
Legal / regulatoryConservative is safer

When you're strict, you accept that some good responses get flagged. That's fine - you'd rather manually review false alarms than miss real problems.

When to be lenient (tolerate false negatives):

DomainWhy
Style / toneSubjective, over-flagging kills creativity
HelpfulnessDon't penalize unconventional but correct answers
Exploratory responsesLet the model take risks

When you're lenient, you accept that some bad responses slip through. That's fine - you'd rather not suppress good responses in the name of catching every edge case.

The key insight: There's no "correct" threshold. It depends on the cost of each error type:

Cost of false positive = good response marked bad = user sees rejection, or you waste time reviewing
Cost of false negative = bad response marked good = user sees bad output, trust erodes

Tune your prompt language to reflect which cost you're optimizing against.


Getting the Context Right (For Evaluation)

The judge needs the same context the model had - plus ground truth.

What the Judge Needs

ContextWhy
Original user inputWhat was the task?
System promptWhat were the constraints?
Retrieved/injected contextWhat information was available?
Ground truth (if available)What's the right answer?
Domain rubricWhat does "good" mean here?

Common Context Mistakes

MistakeResult
Judge doesn't see system promptPenalizes responses that followed instructions
Judge doesn't see retrieved contextCan't assess context utilization
No ground truth for factual claimsJudge hallucinates correctness
Generic rubric for domain taskScores don't reflect domain quality

The Evaluation Stack

Putting it together:

┌─────────────────────────────────────────────────┐
│              What to Evaluate                    │
├─────────────────────────────────────────────────┤
│  Turn-level: individual response quality         │
│  Session-level: conversation success             │
│  Dimension-level: specific quality aspects       │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│              How to Evaluate                     │
├─────────────────────────────────────────────────┤
│  LLM-as-judge: style, coherence, relative rank  │
│  Programmatic: format, constraints, keywords     │
│  Human expert: domain accuracy, subtle quality   │
│  User signal: completion, retention, feedback    │
└─────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│              Evaluation Quality                  │
├─────────────────────────────────────────────────┤
│  Right prompt: specific, anchored, reasoned      │
│  Right context: full info, ground truth, rubric  │
│  Right units: binary/scaled/decomposed           │
└─────────────────────────────────────────────────┘

What We Learned

Going from 60% to 90% wasn't about finding a better model. It was about:

  1. Recognizing that eval is the same problem - prompt and context
  2. Knowing when LLM-as-judge works - and when it doesn't
  3. Choosing the right units - binary, scaled, or decomposed
  4. Evaluating sessions, not just turns - because users have goals
  5. Treating eval prompts as rubrics - specific, anchored, domain-aware

The 30% gap was hiding in vague prompts, missing context, and aggregate scores that hid failure modes.


Built at Anuvaya. We're building AI in India - if this resonates, we'd love to hear from you.

Jashn Maloo
Jashn Maloo
@jashnm