Your Eval Is Broken for the Same Reason Your LLM Is

"If you can't measure it, you can't improve it"

We didn't build RCE because we had a clever idea. We built it because we finally learned to measure what was actually broken.

For months, we knew something was wrong. Users told us. But we couldn't point at what. Once we learned to measure context utilization, reasoning fidelity - the problems became visible. And visible problems have solutions.

This is the story of learning what to measure. The interventions came after.

Every problem with LLMs boils down to two things:

The right prompt
The right context

This is true for your product. It's true for your agents. And it's true for evaluation itself.

We learned this the hard way - spending months on evaluation approaches that didn't work before realizing: the same principles that make LLMs work also make LLM evaluation work.

The Core Insight

When your LLM gives a bad response, it's one of two problems:

Problem	What It Looks Like
Wrong prompt	Model misunderstands the task, wrong format, ignores constraints
Wrong context	Model lacks information, uses stale data, hallucinates to fill gaps

Evaluation is no different. When your evals give misleading results, it's the same two problems:

Eval Problem	Root Cause
Inconsistent scores	Prompt doesn't constrain the judge well enough
Scores don't match human judgment	Context missing - judge doesn't know what "good" means for your domain
High scores but users still complain	Evaluating the wrong thing - prompt asks the wrong question

Fix the prompt. Fix the context. The rest follows.

LLM as a Judge

Using an LLM to evaluate LLM outputs is powerful - but not universally applicable.

When LLM-as-Judge Works

Use Case	Why It Works
Style and tone	Subjective but pattern-matchable
Instruction following	Clear criteria, binary-ish
Coherence and fluency	LLMs are trained on this
Relative comparison	"Which is better: A or B?"
Reasoning chain validity	Can trace logic steps

When LLM-as-Judge Fails

Use Case	Why It Fails
Factual accuracy	Judge hallucinates same as subject
Domain expertise	Judge doesn't know your domain better than the model being judged
Numerical precision	LLMs are bad at math, including counting errors
Subtle correctness	"Almost right" looks right to the judge
Novel failure modes	Judge has same blind spots

An LLM judge will confidently score a response as correct when:

It sounds authoritative
It matches the judge's training distribution
The error requires domain expertise to catch

This is why domain-specific eval needs domain-specific ground truth - not another LLM.

Use a Different Model Family

If your product uses GPT-4, don't use GPT-4 as the judge.

Same model family means:

Same training data → same biases
Same blind spots → same errors go unnoticed
Same "sounds right" patterns → mutual agreement on wrong answers

When your production model makes a subtle error, a judge from the same family will likely think it's correct - they learned the same patterns of what "sounds authoritative."

Setup	Problem
GPT-4 in prod, GPT-4 as judge	Shared blind spots, errors look correct to both
Claude in prod, Claude as judge	Same issue - correlated failures
GPT-4 in prod, Claude as judge	Different training → independent evaluation

Different model families have different:

Training data distributions
Failure modes
"Sounds right" heuristics

This independence is what makes cross-family evaluation valuable. The judge catches errors the production model is blind to - because they don't share the same blindness.

Caveat: This doesn't guarantee correctness. Both could be wrong in different ways. But it reduces correlated failures, which is the real enemy of evaluation.

Choosing the Right Units

Not everything is a 1-10 scale. Not everything is pass/fail.

Binary Metrics

Best for clear-cut criteria:

Metric	Binary?	Why
Did it use the user's name correctly?	Yes	Either right or wrong
Did it follow the format constraint?	Yes	JSON or not JSON
Did it refuse when it should have?	Yes	Refusal happened or didn't
Is the response "good"?	No	Too subjective for binary

Scaled Metrics

Best for quality gradients:

Metric	Scale	Anchors
Helpfulness	1-5	1 = useless, 3 = acceptable, 5 = exceptional
Factual accuracy	1-5	1 = wrong, 3 = mostly correct, 5 = fully accurate
Reasoning quality	1-5	1 = incoherent, 3 = sound, 5 = insightful

Critical: Define anchors. "4 out of 5" means nothing without calibration.

Decomposed Metrics

When a single score hides too much:

Overall Response Quality: 3.5/5

Breakdown:
- Factual accuracy: 4/5 (one minor error)
- Relevance: 5/5 (directly addressed question)
- Completeness: 2/5 (missed key aspect)
- Tone: 4/5 (slightly formal for context)

The aggregate hides the completeness problem. Decomposition reveals it.

Evaluating Conversations

Single-turn eval is easy. Conversation eval is where it gets hard.

The Session Problem

A conversation isn't just N independent turns. It has:

State accumulation - context builds over turns
Goal progression - user is trying to accomplish something
Recovery opportunities - bad turn can be fixed later
Compounding errors - bad turn can derail everything

What to Measure at Session Level

Metric	What It Captures
Task completion	Did the user accomplish their goal?
Turns to completion	Efficiency - fewer is usually better
Recovery rate	When it went wrong, did it recover?
Drop-off point	Where do users abandon?
Context coherence	Did the agent maintain accurate state?

Turn-Level vs Session-Level

Turn-Level	Session-Level
"Was this response good?"	"Was this conversation successful?"
High scores possible with bad outcome	Captures actual user success
Easier to measure	Harder to define
Useful for debugging	Useful for product metrics

You need both. Turn-level tells you where it broke. Session-level tells you if it broke.

Getting the Prompt Right (For Evaluation)

The eval prompt is your rubric. A vague prompt gives inconsistent results.

Bad Eval Prompt

Rate this response from 1-5 for quality.

Problems:

What is "quality"?
No anchors for scores
No context about domain/task
Different judges will interpret differently

Better Eval Prompt

You are evaluating a response from a financial planning AI assistant.

Context:
- The user asked about retirement savings options
- The user's profile shows: [age, income, risk tolerance, current savings]
- The assistant should reference specific tax-advantaged accounts

Evaluate the response on FACTUAL ACCURACY only:
1 = Contains major factual errors about finance or the user's situation
2 = Contains minor factual errors
3 = Factually correct but generic (doesn't use user's specific profile)
4 = Factually correct and uses user's profile data
5 = Factually correct, uses profile data, and makes accurate inferences

Response to evaluate:
[response]

Score (1-5):
Reasoning:

This works because:

Single dimension (factual accuracy)
Clear anchors with examples
Domain context provided
Requires reasoning (catches lazy scoring)

Tuning for False Positives vs False Negatives

Your eval prompt implicitly sets a threshold. You can tune it to be strict or lenient - and the right choice depends on what you're optimizing for.

Tuning direction	What happens	Prompt language
More false positives (strict)	Flags good responses as bad	"If there's ANY doubt, mark incorrect"
More false negatives (lenient)	Lets bad responses through	"Only mark incorrect if clearly, obviously wrong"

When to be strict (prefer false positives):

Domain	Why
Safety / compliance	Missing harmful content is worse than over-flagging
Medical / financial accuracy	Wrong info has real consequences
Legal / regulatory	Conservative is safer

When you're strict, you accept that some good responses get flagged. That's fine - you'd rather manually review false alarms than miss real problems.

When to be lenient (tolerate false negatives):

Domain	Why
Style / tone	Subjective, over-flagging kills creativity
Helpfulness	Don't penalize unconventional but correct answers
Exploratory responses	Let the model take risks

When you're lenient, you accept that some bad responses slip through. That's fine - you'd rather not suppress good responses in the name of catching every edge case.

The key insight: There's no "correct" threshold. It depends on the cost of each error type:

Cost of false positive = good response marked bad = user sees rejection, or you waste time reviewing
Cost of false negative = bad response marked good = user sees bad output, trust erodes

Tune your prompt language to reflect which cost you're optimizing against.

Getting the Context Right (For Evaluation)

The judge needs the same context the model had - plus ground truth.

What the Judge Needs

Context	Why
Original user input	What was the task?
System prompt	What were the constraints?
Retrieved/injected context	What information was available?
Ground truth (if available)	What's the right answer?
Domain rubric	What does "good" mean here?

Common Context Mistakes

Mistake	Result
Judge doesn't see system prompt	Penalizes responses that followed instructions
Judge doesn't see retrieved context	Can't assess context utilization
No ground truth for factual claims	Judge hallucinates correctness
Generic rubric for domain task	Scores don't reflect domain quality

The Evaluation Stack

Putting it together:

┌─────────────────────────────────────────────────┐
│              What to Evaluate                    │
├─────────────────────────────────────────────────┤
│  Turn-level: individual response quality         │
│  Session-level: conversation success             │
│  Dimension-level: specific quality aspects       │
└─────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────┐
│              How to Evaluate                     │
├─────────────────────────────────────────────────┤
│  LLM-as-judge: style, coherence, relative rank  │
│  Programmatic: format, constraints, keywords     │
│  Human expert: domain accuracy, subtle quality   │
│  User signal: completion, retention, feedback    │
└─────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────┐
│              Evaluation Quality                  │
├─────────────────────────────────────────────────┤
│  Right prompt: specific, anchored, reasoned      │
│  Right context: full info, ground truth, rubric  │
│  Right units: binary/scaled/decomposed           │
└─────────────────────────────────────────────────┘

What We Learned

Going from 60% to 90% wasn't about finding a better model. It was about:

Recognizing that eval is the same problem - prompt and context
Knowing when LLM-as-judge works - and when it doesn't
Choosing the right units - binary, scaled, or decomposed
Evaluating sessions, not just turns - because users have goals
Treating eval prompts as rubrics - specific, anchored, domain-aware

The 30% gap was hiding in vague prompts, missing context, and aggregate scores that hid failure modes.

Built at Anuvaya. We're building AI in India - if this resonates, we'd love to hear from you.

Jashn Maloo

@jashnm