# Inside Anuvaya

Technical notes, engineering writeups, and design writing from Anuvaya Labs.

## Multi-Agent Conversational AI

URL: https://inside.anuvaya.com/notes/multi-agent-orchestration
Author: Nitesh Kumar Niranjan
Published: 2026-06-03

import ObservabilityPipeline from '@/components/ObservabilityPipeline'

Most AI products are a search box that learned to type. You send a message, you wait, a wall of text appears. That's a transaction. People feel the difference from a conversation in the first ten seconds.

Watch someone talk to an agent they actually trust. They interrupt. They split one thought across five messages. They ask a question, then add the context that mattered a beat later. They go quiet for three minutes and come back mid-sentence. In our product, a third of all user messages arrive before the agent has finished answering the one before. People don't wait their turn.

The tail is what breaks naive systems. Most people fire two or three messages in a row, plenty fire five or more, and one fired thirty-six before the agent could get a word in. Request-response holds none of it. We published the [principles](/notes/stateful-agent-orchestration) a while back. This is what survived contact with real users, across tens of thousands of conversations and sessions that run a full hour.

## Talking and thinking are different jobs

The useful work is slow. A real answer can mean several tool calls, several model turns, cross-referencing everything you know about someone. Done inside the conversation, that's a minute of the user watching a typing indicator. Nobody waits a minute.

So the part that talks and the part that thinks are different processes. One stays present: it acknowledges, handles the back-and-forth, and never blocks. The other does the slow work beside it and hands the result over when it's ready. The user feels a fast, attentive conversation with a depth that arrives on its own.

They don't talk to each other directly. They share a [Terra](/notes/terra) Kernel, a small in-session document store: the analysis process writes its result into a slot, the conversation process subscribes and picks it up on its next turn. No messages passed between agents, no choreography to fall out of sync, just a shared document with pub/sub.

```mermaid
flowchart TD
  U([user]) --> C
  subgraph S["Terra Session · supervisor"]
    C["Conversation agent"]:::accent
    A["Analysis agent"]
    K["Kernel · shared store"]
  end
  C -.->|"triggers"| A
  A -->|"writes result"| K
  K -->|"subscribe"| C
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

That was a deliberate choice. Two agents coordinating by sending each other messages is a distributed-systems problem you do not want inside a live conversation: ordering, retries, partial failure, all of it in the hot path. A shared store collapses that into a single question the conversation process can answer instantly: what's in the slot right now.

The store is small and strict, and the strictness is the point. Reads never block: an agent asks for a slot and gets the current document or nothing, never a wait. A writer locks the slot, does its work, and commits in one atomic step, and because the lock belongs to whoever holds it, a crash mid-write releases it on the way down. No deadlock, no half-written state, no cleanup code. The whole session sits under one supervisor: kill the store and every agent restarts with it, kill one agent and only that one comes back. None of that is code we wrote. It's what running agents as real processes buys you, and most of why we do.

How a slot reads back is the elegant part. The kernel wraps its content with the slot's state, locked or idle, and the time it was last written. So when the conversation agent reads the analysis slot, the model can see for itself whether the work is still in flight or already settled. Coordination that usually hides under the runtime becomes something the model reasons about directly.

The split pays off again on interruption. Most of what people send mid-thought is more context, not a new question. If the slow work lived inside the conversation, every "oh, and also" would cancel it and start over. Two processes means the thinking ignores the chatter and finishes.

## One invariant carries the whole system

Three things have to stay identical, always: what the user saw, what you persisted, and what the model sees on the next turn.

Break it in either direction and the illusion collapses. Save more than the user saw, and the agent references a sentence they never read. Save less, and it forgets something they're certain they said. Add streaming, interruptions, and that background work all landing at once, and keeping the three in lockstep becomes most of the engineering. It sounds trivial written down. It is the hardest thing in the system, and most of what makes an interruption feel seamless instead of broken is holding this line under pressure.

## Impatient users get answered, not dropped

When a third of messages arrive before the last answer is done, the naive design, cancel the response and start over, cancel again, burns a minute producing nothing. Five messages, five false starts, zero replies.

So a burst isn't five interruptions, it's one input. Let the messages land, read them together, answer all of it at once. The person who fired off five quick lines gets a single reply that addresses every one, which is what they wanted when they sent them.

## The patterns are easy to say and brutal to ship

You can write the whole approach in three lines. Split talking from thinking. Hold one invariant. Batch the impatient. The sentences are simple.

Everything underneath them is not. What counts as a complete thought worth delivering. When silence means "keep going" and when it means "stop." How to recover when an interruption, a background result, and a five-message burst all land in the same second. Those thousand judgments are where the years went, and they don't fit in an article or a framework. That part is the product.

## Watching it run

A system this concurrent is only as trustworthy as your ability to see inside it. Terra emits a telemetry event for everything an agent does: state changes, invocations, tool calls, kernel reads and writes. We batch those events, stream them over NATS, and land them in ClickHouse, where a conversation from three months ago can be pulled apart event by event. When something looks wrong in production, you don't guess. You query.

<ObservabilityPipeline client:visible />

## The full stack

[Rune](/notes/rune) is the memory: an append-only graph of how a user's life evolves across sessions. [Terra](/notes/terra) is the runtime: stateful agent processes, context aging, shared state. This is the layer that makes them feel like a person, instant, interruptible, and patient with the way people actually type.

Three problems, three layers. Rune remembers. Terra runs. This orchestrates. Together they're a conversation worth coming back to a hundred times.

---

_Built at Anuvaya. We're building AI in India._

_Third of three. Previously: [Rune, long-term memory for AI agents](/notes/rune) and [Terra, an agent framework extracted from production](/notes/terra)._

## Terra: An Agent Framework Extracted from Production

URL: https://inside.anuvaya.com/notes/terra
Author: Nitesh Kumar Niranjan
Published: 2026-06-03

import ConversationGrowth from '@/components/ConversationGrowth'

On December 3, 2024, we opened a private beta and one of our agents had its first real conversation. It's been in production ever since: more than a million tool calls across tens of thousands of people.

LangChain and AutoGPT made agents look easy. Production made them look like toys: the moment you need an agent to survive a crash mid-response, age tool results out of context, or coordinate with another agent in real time, the five-line demo is gone. The JS/TS ecosystem is home for us, but for this problem Elixir was the better fit. Agents are stateful, long-running processes that need to handle concurrent inputs, recover from crashes, and coordinate with each other. That's what OTP was built for. The tradeoff: Elixir had no AI agent tooling, so we built our own.

Terra is what we extracted from eighteen months of production. After enough iterations on different agent architectures, patterns emerged that were reusable: a lifecycle model, a context aging pipeline, a multi-agent coordination layer. We pulled them into a framework so we could spin up new agent structures quickly instead of rewiring the same plumbing each time.

We're open-sourcing it: [github.com/anuvaya/terra](https://github.com/anuvaya/terra).

<ConversationGrowth client:visible />

## Agents are processes

A conversational AI agent is a stateful, long-running process. It starts, transitions through states (greeting, active, idle), handles concurrent inputs (user messages arriving while the LLM is mid-response), needs to survive crashes without losing conversation state, and eventually terminates. Two agents might need to coordinate through shared state in real time.

Thousands of our sessions run long, some a full hour of back-and-forth: exactly what a process model is for.

`gen_statem` is OTP's state machine primitive. It gives us typed state transitions, process supervision (automatic restart on crash), and transparent distribution (agents addressable across nodes). We didn't pick it to be contrarian. We picked it because the problem description and the tool description are the same sentence.

```mermaid
flowchart TD
  A([send_input]) --> B[handle_input]
  B -->|invoke| C[context/2]:::accent
  C --> D[Provider.stream]
  D --> E[handle_stream_event]
  E -->|tool_use · eager| F[ToolRegistry.execute]
  E -->|message_stop| G[handle_response/2]
  G --> H{"consumer decides"}
  H -->|transition / re-invoke| B
  H -->|timeout / stop| X([terminate])
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

You implement five callbacks: `init`, `context`, `handle_input`, `handle_stream_event`, `handle_response`. Terra handles the runtime: streaming, tool execution, state transitions, crash recovery. The framework disappears. You think about your agent's logic, not about plumbing.

## Tools

An agent's tools are JSON Schema, and writing JSON Schema by hand is misery. Terra gives you a pipeline builder instead:

```elixir
tool("get_weather")
|> desc("Get current weather for a city")
|> param(:city, :string, required: true, desc: "City name")
```

That produces the nested schema the provider expects. Objects, enums, and arrays read the same way, top to bottom, no braces to balance.

Tools live in a registry that both declares them and runs them:

```elixir
defmodule WeatherTools do
  use Terra.ToolRegistry
  import Terra.Tool

  def tools(_state), do: [tool("get_weather") |> param(:city, :string, required: true)]

  def execute("get_weather", %{city: city}, state) do
    {:ok, %{temp: 72, city: city}, state}
  end
end
```

`tools/1` receives the agent's state, so the toolset can shift with the conversation: a tool can appear, disappear, or change shape based on where the agent is. `execute/3` runs the call. You list the registries in the agent's `init`, and Terra handles serialization, dispatch, and results.

Execution is eager. The moment a tool call finishes streaming, Terra runs it, so the result is in hand by the time the model's turn ends. The agent never makes a second round-trip to fetch what the model just asked for.

## Context aging

A tool result from eight turns ago is noise, not signal. Every agent framework leaves this to the developer. Terra treats it as a first-class concern.

Each tool defines its own aging configuration:

```elixir
tool("get_weather")
|> aging(expiry: 4, pruning: 8)
|> result_template("Weather in <%= @input[:city] %>: <%= @result %>")
|> expiry_message("Weather for <%= @input[:city] %> expired. Re-fetch if needed.")
```

As turn distance grows, each tool result ages through three states:

```mermaid
flowchart LR
  A["active · turns 0–3 · full result"]:::accent --> E["expired · turns 4–7 · short summary"] --> P["pruned · turns 8+ · removed"]
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

Before every LLM invocation, `age_tools` walks the conversation history in reverse. Each tool result transitions through active (full result via EEx template), expired (short summary), and pruned (removed entirely). The corresponding `tool_use` blocks in assistant messages are cleaned up too, so the model never sees a tool call without its result.

Thinking blocks get the same treatment. Extended thinking output is large and only useful for recent reasoning. `prune_thinking(1)` keeps only the most recent turn's thinking, stripping the rest. Context stays focused.

## Multi-agent sessions

```mermaid
flowchart TD
  S(["Terra.Session · rest_for_one"]) --> K[Kernel]:::accent
  S --> A["Agent A"]
  S --> B["Agent B"]
  A <-->|"read / write / subscribe"| K
  B <-->|"read / write / subscribe"| K
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

A Session is a supervisor that starts a Kernel (shared document store) and one or more agents. The Kernel holds named slots that any agent can read, write, or subscribe to. When one agent writes to a slot, every subscriber is notified.

This is how we wire a conversation agent and an analysis agent together: the analysis agent writes its results to a slot, the conversation agent subscribes and picks them up on its next turn. No shared mutable state. No message-passing choreography. Just a document store with pub/sub.

The pattern generalizes. A critic reviews the conversation agent's drafts before they ship. A coach feeds tone guidance back in real time. A reflector keeps a running digest of the session. Anything that has to share state with the live conversation, turn by turn, fits the same shape.

The supervisor is `rest_for_one`: if the Kernel crashes, all agents restart. If an agent crashes, only that agent restarts. OTP handles this. We didn't have to build it.

## Bring your own provider

Terra ships with Anthropic, OpenAI, and Google. But the provider boundary is one function: `stream/2`. Implement it and Terra drives any model, hosted or local, with no other code changes. Switching providers is a config change, not a rewrite.

The messy parts stay contained per provider: SSE parsing, error handling, retries, response normalization. The agent layer never sees a provider-specific type.

## What we kept out

Terra has opinions about what's not its job:

- **Persistence.** Terra manages in-memory conversation state. How you persist it is your problem. We use Postgres. You might use something else.
- **Transport.** Terra doesn't know about HTTP, WebSockets, or NATS. It's a process you send messages to and receive callbacks from. Your transport layer wraps it.
- **Prompt engineering.** The `context/2` callback builds the window. What goes in it is your decision. Terra provides the pipeline (aging, documents, system prompt), not the content.

The framework should be less interesting than what you build with it.

---

_Built at Anuvaya. We're open-sourcing Terra because the infrastructure layer shouldn't be proprietary. The agent logic on top of it should be._

_Second of three. Previously: [Rune, long-term memory for AI agents](/notes/rune). Next: [multi-agent conversational AI](/notes/multi-agent-orchestration)._

## Rune - Long-Term Memory for AI Agents

URL: https://inside.anuvaya.com/notes/rune
Author: Nitesh Kumar Niranjan
Published: 2026-06-02

import RuneGraph from '@/components/RuneGraph'
import StarVsChain from '@/components/StarVsChain'
import ReadLatency from '@/components/ReadLatency'
import GrowthCurve from '@/components/GrowthCurve'

The simplest memory an LLM has is its context window. We [solved that layer](https://inside.anuvaya.com/notes/realtime-context-engine): a context engine that keeps it fresh and consistent within a session.

But what does the agent know about you on session 98?

The industry answer is extraction. Pull facts from conversations ("user is planning a trip," "user left a job under bad circumstances"), turn them into memory units, and retrieve them next session. As the category matured, the stack got better: rerankers, entity linking, profiles, graph relationships, version chains, automatic forgetting. Useful improvements. But the primitive stayed the same: extract a memory now, decide how it should evolve, and trust retrieval to reconstruct the right story later.

```mermaid
flowchart TD
  C["conversation"] --> E["extract facts"] --> M["memory units"] --> S["store / index"]
  S --> R1["rerank"]
  S --> R2["link / version"]
  S --> R3["profile / expire"]
  R1 --> RT["retrieve"]
  R2 --> RT
  R3 --> RT
  RT --> I["inject into context"]
```

Some systems compress aggressively. Some version. Some forget. Some build graphs around the extracted facts. All of them are trying to solve the same problem: make remembered information manageable at read time.

These systems solve real problems well. If your agent mainly needs preference recall, profile building, or cheap personalization, this stack is sensible.

Our objection is narrower. For long-lived advisory relationships, continuity lives in the path between facts.

And compression is exactly where that path starts disappearing. Once a history is flattened into summaries, merged memories, or a current profile, the connective tissue is gone.

If you've used Claude Code or Codex, you've watched one version of that tradeoff in real time. The conversation compacts. Earlier work flattens into a summary. Details vanish. Fine for a coding session. For a user who comes back hundreds of times over years, each compression pass bets that what it discards won't matter later.

We made that bet for a year. Five versions. Lost it often enough to stop.

## The wrong question

Our first attempts tried to decide what mattered at write time. But importance isn't a property of information in isolation. It's a property of how information connects to everything else. A user mentions leaving a job in session 3. Twelve sessions later, they're hesitant about a new opportunity. You understand why without being told. Compressed memory misses that connection. Write-time selection has false negatives, and in conversation, false negatives compound.

Later versions tried the inverse: keep everything, discard what seems irrelevant. That broke across users. The discard logic that's right for one profile corrupts memory for another.

Same destination every time: retrieval had to become precise enough that keeping everything remained viable.

That is the boundary of our claim. Factual memory is the wrong atom for the kind of continuity we need.

## Structure, not compression

Rune is an append-only knowledge graph. Five node types: entities (people, places, things), states (life domains like career, marriage, health), relationships (how two entities connect), events (something that happened), and assertions (exact words worth preserving).

The topology does the heavy lifting. States and relationships are hubs. Events are joints between them. Assertions are leaves, nothing ever points to a leaf.

This is the key difference. Rune stores states and transitions, with causality as a first-class property.

<RuneGraph client:visible />

Walk any chain and you get a coherent causal story: a new job sits inside a sequence that includes the previous role, the exit, and whatever else shaped the transition. Compare this to a star topology:

<StarVsChain client:visible />

Stars are easy to build. They're also flat. You can't traverse them and reconstruct a narrative. Chains can.

A fact graph tells you what is true, what was updated, and what extends what. A causal graph tells you what changed, what it changed from, what else it touched, and which exact words anchored the transition. For us, that distinction is the whole system.

Preferences live here too. How someone wants to be engaged (direct answers, no sugarcoating), these are assertions on the agent-user relationship chain. Same structure that tracks "got promoted" tracks "prefers bluntness."

## LLM reads, deterministic code writes

The LLM is powerful but unreliable. So we split the pipeline.

```mermaid
flowchart TD
  C["conversation"] --> R["reader agent · LLM"]:::accent
  R -->|"queries graph · reasons about what's new"| P["extraction plan"]
  TC["typed contract"] -.-> P
  P --> B["builder · deterministic"]
  B -->|"validates · rolls back on failure"| G["graph"]
  G -->|"error → feed back for retry"| R
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

A reader agent examines the conversation and proposes a structured extraction plan. A deterministic builder executes it against hard constraints. The LLM never touches the graph.

The constraint system is the fence. A static edge-validity table defines which node types can connect: assertions can never source edges, entities go through relationships, states need events as joints. Semantic rules add what topology alone can't enforce: one root state per topic, no duplicate relationships between the same pair. When a single event affects multiple domains, shadow nodes keep the DAG intact. If any constraint fails mid-operation, everything rolls back. No half-written graphs.

The plan is the contract between the two sides. The LLM proposes operations using identifiers it invented. The builder maps those to sequence numbers. Bad references get rejected; the error feeds back for self-correction. The LLM proposes. Deterministic code enforces. The graph stays clean.

Any AI system where an LLM mutates structured state benefits from this separation.

## Append-only

You never update or delete nodes. Events evolve states. History is always there. The LLM can't corrupt what exists, only add to it.

This handles a tension most memory systems ignore: discovery order versus causal order. Information arrives piecemeal, out of sequence, across sessions. Writes happen when we learn something. Reads reconstruct what actually happened and when. A user reveals in session twelve that they hadn't spoken to their partner for six months after a career setback. Session three stays intact. New paths connect nodes that already exist.

Current state is read from the graph. The past stays exactly where it happened.

Retrieval is graph traversal, not search. It narrows deterministically, entity then domain, then walks the relevant chain. Reads land in around 20 milliseconds. That speed is the whole reason we built a graph: the structure answers most questions outright, so we almost never fall back to vector search.

<ReadLatency client:visible />

## So what does this enable?

Session 98 carries the full structured history of sessions 1 through 97. Every relationship, event, preference: navigable. No nightly compression deciding what you're allowed to remember. In production our power users are already there: relationships that cross 100 sessions, the deepest past 850.

A human professional who has advised thousands of people carries pattern recognition no single client could produce. They've seen this before. They know what to ask next. Rune is built for the same arc. Each user's graph stays their own; the system's read on the domain compounds with every conversation it has ever had.

## How we measure this

Rune runs in production today, with over 147,000 graph records built from real conversations.

<GrowthCurve client:visible />

We don't chase fact-recall benchmarks. They reward remembering that a user likes coffee. Our problem is harder and differently shaped: can the system reconstruct why a career change three months ago connects to a relationship that started unraveling a year before that?

So we hold Rune to three questions. Did extraction build the right structure from the conversation? Can retrieval reconstruct the causal chain later? Does that hold as the sessions pile up?

We answer them by replay against real conversation data: build memory through session n-1, replay session n, measure whether what matters actually surfaces. The target is the messy, nonlinear, contradictory way people talk about their lives across months, not a synthetic dataset.

Rune is built for the moment where one missed session distorts the entire present.

The previous five versions failed that test in different ways. Rune cleared it.

## Your Eval Is Broken for the Same Reason Your LLM Is

URL: https://inside.anuvaya.com/notes/llm-evaluation
Author: Jashn Maloo
Published: 2025-12-22

> "If you can't measure it, you can't improve it"

We didn't build [RCE](/notes/realtime-context-engine) because we had a clever idea. We built it because we finally learned to measure what was actually broken.

For months, we knew something was wrong. Users told us. But we couldn't point at *what*. Once we learned to measure context utilization, reasoning fidelity - the problems became visible. And visible problems have solutions.

This is the story of learning what to measure. The interventions came after.

---

Every problem with LLMs boils down to two things:

1. **The right prompt**
2. **The right context**

This is true for your product. It's true for your agents. And it's true for evaluation itself.

We learned this the hard way - spending months on evaluation approaches that didn't work before realizing: the same principles that make LLMs work also make LLM evaluation work.

---

## The Core Insight

When your LLM gives a bad response, it's one of two problems:

| Problem | What It Looks Like |
|---------|-------------------|
| **Wrong prompt** | Model misunderstands the task, wrong format, ignores constraints |
| **Wrong context** | Model lacks information, uses stale data, hallucinates to fill gaps |

Evaluation is no different. When your evals give misleading results, it's the same two problems:

| Eval Problem | Root Cause |
|--------------|------------|
| Inconsistent scores | Prompt doesn't constrain the judge well enough |
| Scores don't match human judgment | Context missing - judge doesn't know what "good" means for your domain |
| High scores but users still complain | Evaluating the wrong thing - prompt asks the wrong question |

Fix the prompt. Fix the context. The rest follows.

---

## LLM as a Judge

Using an LLM to evaluate LLM outputs is powerful - but not universally applicable.

### When LLM-as-Judge Works

| Use Case | Why It Works |
|----------|--------------|
| **Style and tone** | Subjective but pattern-matchable |
| **Instruction following** | Clear criteria, binary-ish |
| **Coherence and fluency** | LLMs are trained on this |
| **Relative comparison** | "Which is better: A or B?" |
| **Reasoning chain validity** | Can trace logic steps |

### When LLM-as-Judge Fails

| Use Case | Why It Fails |
|----------|--------------|
| **Factual accuracy** | Judge hallucinates same as subject |
| **Domain expertise** | Judge doesn't know your domain better than the model being judged |
| **Numerical precision** | LLMs are bad at math, including counting errors |
| **Subtle correctness** | "Almost right" looks right to the judge |
| **Novel failure modes** | Judge has same blind spots |

### The Judge's Blind Spot

An LLM judge will confidently score a response as correct when:
- It *sounds* authoritative
- It matches the judge's training distribution
- The error requires domain expertise to catch

This is why domain-specific eval needs domain-specific ground truth - not another LLM.

### Use a Different Model Family

If your product uses GPT-4, don't use GPT-4 as the judge.

Same model family means:
- Same training data → same biases
- Same blind spots → same errors go unnoticed
- Same "sounds right" patterns → mutual agreement on wrong answers

When your production model makes a subtle error, a judge from the same family will likely think it's correct - they learned the same patterns of what "sounds authoritative."

| Setup | Problem |
|-------|---------|
| GPT-4 in prod, GPT-4 as judge | Shared blind spots, errors look correct to both |
| Claude in prod, Claude as judge | Same issue - correlated failures |
| GPT-4 in prod, Claude as judge | Different training → independent evaluation |

Different model families have different:
- Training data distributions
- Failure modes
- "Sounds right" heuristics

This independence is what makes cross-family evaluation valuable. The judge catches errors the production model is blind to - because they don't share the same blindness.

**Caveat**: This doesn't guarantee correctness. Both could be wrong in different ways. But it reduces correlated failures, which is the real enemy of evaluation.

---

## Choosing the Right Units

Not everything is a 1-10 scale. Not everything is pass/fail.

### Binary Metrics

Best for clear-cut criteria:

| Metric | Binary? | Why |
|--------|---------|-----|
| Did it use the user's name correctly? | Yes | Either right or wrong |
| Did it follow the format constraint? | Yes | JSON or not JSON |
| Did it refuse when it should have? | Yes | Refusal happened or didn't |
| Is the response "good"? | **No** | Too subjective for binary |

### Scaled Metrics

Best for quality gradients:

| Metric | Scale | Anchors |
|--------|-------|---------|
| Helpfulness | 1-5 | 1 = useless, 3 = acceptable, 5 = exceptional |
| Factual accuracy | 1-5 | 1 = wrong, 3 = mostly correct, 5 = fully accurate |
| Reasoning quality | 1-5 | 1 = incoherent, 3 = sound, 5 = insightful |

**Critical**: Define anchors. "4 out of 5" means nothing without calibration.

### Decomposed Metrics

When a single score hides too much:

```
Overall Response Quality: 3.5/5

Breakdown:
- Factual accuracy: 4/5 (one minor error)
- Relevance: 5/5 (directly addressed question)
- Completeness: 2/5 (missed key aspect)
- Tone: 4/5 (slightly formal for context)
```

The aggregate hides the completeness problem. Decomposition reveals it.

---

## Evaluating Conversations

Single-turn eval is easy. Conversation eval is where it gets hard.

### The Session Problem

A conversation isn't just N independent turns. It has:
- **State accumulation** - context builds over turns
- **Goal progression** - user is trying to accomplish something
- **Recovery opportunities** - bad turn can be fixed later
- **Compounding errors** - bad turn can derail everything

### What to Measure at Session Level

| Metric | What It Captures |
|--------|------------------|
| **Task completion** | Did the user accomplish their goal? |
| **Turns to completion** | Efficiency - fewer is usually better |
| **Recovery rate** | When it went wrong, did it recover? |
| **Drop-off point** | Where do users abandon? |
| **Context coherence** | Did the agent maintain accurate state? |

### Turn-Level vs Session-Level

| Turn-Level | Session-Level |
|------------|---------------|
| "Was this response good?" | "Was this conversation successful?" |
| High scores possible with bad outcome | Captures actual user success |
| Easier to measure | Harder to define |
| Useful for debugging | Useful for product metrics |

You need both. Turn-level tells you *where* it broke. Session-level tells you *if* it broke.

---

## Getting the Prompt Right (For Evaluation)

The eval prompt *is* your rubric. A vague prompt gives inconsistent results.

### Bad Eval Prompt

```
Rate this response from 1-5 for quality.
```

Problems:
- What is "quality"?
- No anchors for scores
- No context about domain/task
- Different judges will interpret differently

### Better Eval Prompt

```
You are evaluating a response from a financial planning AI assistant.

Context:
- The user asked about retirement savings options
- The user's profile shows: [age, income, risk tolerance, current savings]
- The assistant should reference specific tax-advantaged accounts

Evaluate the response on FACTUAL ACCURACY only:
1 = Contains major factual errors about finance or the user's situation
2 = Contains minor factual errors
3 = Factually correct but generic (doesn't use user's specific profile)
4 = Factually correct and uses user's profile data
5 = Factually correct, uses profile data, and makes accurate inferences

Response to evaluate:
[response]

Score (1-5):
Reasoning:
```

This works because:
- Single dimension (factual accuracy)
- Clear anchors with examples
- Domain context provided
- Requires reasoning (catches lazy scoring)

### Tuning for False Positives vs False Negatives

Your eval prompt implicitly sets a threshold. You can tune it to be strict or lenient - and the right choice depends on what you're optimizing for.

| Tuning direction | What happens | Prompt language |
|------------------|--------------|-----------------|
| **More false positives** (strict) | Flags good responses as bad | "If there's ANY doubt, mark incorrect" |
| **More false negatives** (lenient) | Lets bad responses through | "Only mark incorrect if clearly, obviously wrong" |

**When to be strict (prefer false positives):**

| Domain | Why |
|--------|-----|
| Safety / compliance | Missing harmful content is worse than over-flagging |
| Medical / financial accuracy | Wrong info has real consequences |
| Legal / regulatory | Conservative is safer |

When you're strict, you accept that some good responses get flagged. That's fine - you'd rather manually review false alarms than miss real problems.

**When to be lenient (tolerate false negatives):**

| Domain | Why |
|--------|-----|
| Style / tone | Subjective, over-flagging kills creativity |
| Helpfulness | Don't penalize unconventional but correct answers |
| Exploratory responses | Let the model take risks |

When you're lenient, you accept that some bad responses slip through. That's fine - you'd rather not suppress good responses in the name of catching every edge case.

**The key insight**: There's no "correct" threshold. It depends on the cost of each error type:

```
Cost of false positive = good response marked bad = user sees rejection, or you waste time reviewing
Cost of false negative = bad response marked good = user sees bad output, trust erodes
```

Tune your prompt language to reflect which cost you're optimizing against.

---

## Getting the Context Right (For Evaluation)

The judge needs the same context the model had - plus ground truth.

### What the Judge Needs

| Context | Why |
|---------|-----|
| **Original user input** | What was the task? |
| **System prompt** | What were the constraints? |
| **Retrieved/injected context** | What information was available? |
| **Ground truth (if available)** | What's the right answer? |
| **Domain rubric** | What does "good" mean here? |

### Common Context Mistakes

| Mistake | Result |
|---------|--------|
| Judge doesn't see system prompt | Penalizes responses that followed instructions |
| Judge doesn't see retrieved context | Can't assess context utilization |
| No ground truth for factual claims | Judge hallucinates correctness |
| Generic rubric for domain task | Scores don't reflect domain quality |

---

## The Evaluation Stack

Putting it together:

```mermaid
flowchart TD
  W["What to evaluate"] --> H["How to evaluate"] --> Q["Evaluation quality"]:::accent
  W -.- WN["turn · session · dimension"]
  H -.- HN["judge · programmatic · human · user signal"]
  Q -.- QN["anchored rubric + ground truth"]
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
  classDef note fill:none,stroke:none,color:#8a9099
  class WN,HN,QN note
```

---

## What We Learned

Going from 60% to 90% wasn't about finding a better model. It was about:

1. **Recognizing that eval is the same problem** - prompt and context
2. **Knowing when LLM-as-judge works** - and when it doesn't
3. **Choosing the right units** - binary, scaled, or decomposed
4. **Evaluating sessions, not just turns** - because users have goals
5. **Treating eval prompts as rubrics** - specific, anchored, domain-aware

The 30% gap was hiding in vague prompts, missing context, and aggregate scores that hid failure modes.

---

*Built at [Anuvaya](https://anuvaya.com). We're building AI in India - if this resonates, we'd love to hear from you.*

## RCE - Realtime Context Engine

URL: https://inside.anuvaya.com/notes/realtime-context-engine
Author: Nitesh Kumar Niranjan
Published: 2025-11-28

## Abstract

Large language models excel at reasoning but struggle with **contextual accuracy** in production systems. We introduce the Realtime Context Engine (RCE), a framework for managing dynamic knowledge bases, temporal data hierarchies, and externalized reasoning chains within conversational AI sessions.

---

## 1. Problem Statement

Production AI agents face four fundamental challenges:

### 1.1 Context Amnesia

As conversations extend, tool results and retrieved data accumulate without intelligent decay. Models either hit context limits or lose signal in noise.

### 1.2 Knowledge Base Accuracy

When users have structured profiles with hundreds of variables, the agent must reason against **current state** - not stale snapshots. Mutations mid-session create consistency gaps between what the agent "knows" and what's true.

### 1.3 Temporal Data Accuracy

Many domains involve data with **temporal depth** - layered time-series information where granularity matters. A query about "this week" requires different precision than "this hour." Agents frequently:

- Over-fetch (pulling fine-grained data when coarse is sufficient)
- Under-refresh (using stale temporal snapshots when conditions have shifted)
- Lose temporal anchoring (confusing which data applies to which time window)

### 1.4 Reasoning Opacity

Complex analysis requiring multiple inferential steps happens in a single turn. The model attempts to hold intermediate conclusions in working memory, leading to:

- Dropped threads (forgetting to follow up on raised hypotheses)
- Conflation errors (mixing conclusions across different entities/timeframes)
- Shallow reasoning (surface-level answers to avoid cognitive overload)

---

## 2. The RCE Architecture

RCE treats context not as a static retrieval problem but as a **living system** with freshness, hierarchy, and explicit reasoning state.

### 2.1 Dynamic Knowledge Base Layer

Each session maintains a structured knowledge base comprising:

| Component | Characteristics |
|-----------|-----------------|
| **Primary Profile** | User's core structured data - mutable mid-session |
| **Related Entities** | Connected profiles (relationships, dependencies) |
| **Computed Artifacts** | Derived data from tool executions |

Key principle: **The KB is not retrieved once - it's injected as a living document each turn**, reflecting any mutations from previous tool calls.

```mermaid
flowchart TB
  subgraph KB["Session Knowledge Base"]
    direction TB
    P["Primary profile · mutable"]
    R["Related entities · mutable"]
    C["Computed artifacts · session-scoped"]
    S["Reasoning state · ephemeral"]
    P ~~~ R ~~~ C ~~~ S
  end
```

### 2.2 Temporal Data Hierarchy

Temporal data isn't flat - it has **depth levels** representing different granularities:

```
Level 1: Macro periods (years, phases)
Level 2: Standard periods (months, quarters)
Level 3: Sub-periods (weeks, cycles)
Level 4: Micro periods (days, specific windows)
Level 5: Precision windows (hours, exact moments)
```

RCE manages temporal queries through:

1. **Depth-aware fetching**: Tools specify granularity requirements; system fetches appropriate depth
2. **Window anchoring**: All temporal results tagged with their applicable time window
3. **Freshness signals**: Temporal data carries metadata about when it becomes stale relative to the query context

### 2.3 Context Freshness Management

Not all context ages equally. RCE implements **freshness-aware context management**:

| Freshness State | System Behavior |
|-----------------|-----------------|
| **Current** | Full context available, no modification |
| **Aging** | Context annotated with refresh guidance |
| **Stale** | Context summarized or flagged for replacement |
| **Expired** | Context removed, placeholder indicates prior existence |

The transition between states is domain-configurable and accounts for:

- Data volatility (how quickly this type of information changes)
- Query recency (how recently the user referenced this data)
- Dependency chains (whether other active context relies on this data)

### 2.4 Externalized Reasoning: The Scratch Pad

Instead of forcing complex reasoning into a single inferential pass, RCE provides an **explicit reasoning externalization layer**:

**Supported Reasoning Primitives:**

| Primitive | Purpose |
|-----------|---------|
| `Explore` | Flag an element for deeper investigation |
| `Weigh` | Register competing hypotheses for evaluation |
| `Tension` | Capture contradictions requiring resolution |
| `Validate` | Queue a hypothesis for verification against incoming data |
| `Uncertain` | Mark low-confidence observations needing corroboration |

**Lifecycle:**

1. Agent adds reasoning items during analysis
2. Items persist across turns within session
3. Agent explicitly resolves items with brief findings
4. Unresolved items influence continuation decisions

This creates **observable reasoning chains** and prevents cognitive overload in single turns.

---

## 3. Context Injection Protocol

Each turn, RCE constructs the context payload through:

```
1. Base KB injection (current state of all profiles)
2. Temporal data with freshness annotations
3. Active reasoning state (unresolved scratch pad items)
4. Historical context (with freshness-based filtering)
5. Tool schemas (available capabilities)
```

The protocol ensures the model always reasons against **current truth** while maintaining awareness of what it has previously concluded.

---

## 4. Accuracy Guarantees

### 4.1 KB Consistency

- Mutations from tools immediately reflected in next-turn injection
- No stale reads within session scope
- Related entity updates propagate to dependent views

### 4.2 Temporal Consistency

- All temporal data tagged with generation timestamp and applicable window
- Queries outside cached windows trigger refresh
- Granularity mismatches flagged to model with upgrade/downgrade guidance

### 4.3 Reasoning Consistency

- Externalized reasoning state prevents "forgotten" threads
- Resolution required before conclusions treated as settled
- Contradiction detection via tension primitives

---

## 5. Integration Pattern

RCE operates as a **context middleware** between the application layer and the LLM:

```mermaid
flowchart LR
  APP["Application layer"] -->|request| RCE["RCE engine"]:::accent
  RCE -->|fresh context| LLM["LLM provider"]
  LLM -.->|response| RCE
  RCE -.->|state| APP
  RCE --> CS["Context store"]
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

The application layer provides:

- User identity and session scope
- Domain-specific tool definitions
- Freshness configuration per data type

RCE handles:

- KB hydration and mutation tracking
- Temporal data management
- Reasoning state persistence
- Context window optimization

---

## 6. Conclusion

RCE addresses the fundamental tension between context richness and context accuracy in production AI systems. By treating context as a managed, living system rather than a retrieval artifact, we enable agents that maintain consistency across extended interactions while supporting complex, multi-step reasoning.

---

_We're hiring engineers who want to push the boundaries of what's possible with LLMs in production. If building context management systems, knowledge bases, and reasoning frameworks excites you, check out our open positions or reach out directly. Let's build thoughtful AI together._

## Stateful Agent Orchestration - Human-Like Conversational AI

URL: https://inside.anuvaya.com/notes/stateful-agent-orchestration
Author: Nitesh Kumar Niranjan
Published: 2025-08-15

## Abstract

Current conversational AI feels robotic - responses arrive instantaneously, users cannot interrupt, and agents wait passively between turns. We present an orchestration architecture that enables **human-like conversational behavior**: natural delivery pacing, mid-response interruptibility, autonomous continuation, and graceful recovery.

---

## 1. Problem Statement: Why Agents Feel Robotic

### 1.1 The Human Conversation Baseline

When humans converse, certain behaviors are natural:

- **Interruptible**: You can cut someone off mid-sentence; they adapt
- **Self-paced**: Speech has rhythm - pauses, emphasis, natural cadence
- **Self-directed**: A speaker can continue elaborating without being prompted
- **Recoverable**: After interruption, conversation resumes coherently

### 1.2 How Current Agents Fail

| Human Behavior | Agent Failure |
|----------------|---------------|
| Interruptible | Response is atomic - wait until complete or lose context |
| Self-paced | Instant dump of text - no rhythm, overwhelming |
| Self-directed | Strict request-response - halts until user speaks |
| Recoverable | Interruption causes state divergence or phantom references |

### 1.3 The Atomicity Trap

Most agent frameworks treat responses as atomic transactions:

```mermaid
flowchart LR
  A["User input"] --> B["Model generates response"] --> C["Deliver"] --> D["Update state"] --> E["Wait"]
```

This creates:
- **Blocked interaction**: User cannot interrupt verbose responses
- **Wasted compute**: Generation continues after user intent shifts
- **Phantom content**: Model may reference content user never saw
- **Passive agents**: System halts, waiting for input that may not come

---

## 2. Four Pillars of Human-Like Agents

### 2.1 Natural Delivery Pacing

Instant delivery feels mechanical. Human-like agents implement **naturalistic pacing**:

- Variable speed based on content complexity
- Contextual acceleration (urgent responses faster)
- Pause patterns at semantic boundaries
- Rhythm that allows comprehension

The state manager signals delivery urgency:

| Signal | Delivery Behavior |
|--------|-------------------|
| `normal` | Standard pacing - conversational rhythm |
| `urgent` | Accelerated - user signaled time pressure |
| `measured` | Slower - complex content needs absorption |

### 2.2 Mid-Response Interruptibility

Users can interrupt at any point. When they do:

1. In-flight generation terminates cleanly
2. Only delivered content is preserved
3. New turn begins with accurate context
4. No "phantom content" - model never references undelivered text

**Example:**
```
Agent generating: "First, let me explain X. Second, here's Y. Third..."
User interrupts after "Second, here's Y"

What gets saved: "First, let me explain X. Second, here's Y."
What model sees next turn: Only the above - no "Third..."
User experience: Seamless pivot to their new input
```

### 2.3 Autonomous Continuation

Traditional agents operate in strict ping-pong:

```mermaid
flowchart LR
  U["User speaks"] --> A["Agent responds"] --> W["Agent waits"] --> U
```

This breaks when users want to **passively receive** - listening to extended analysis without prompting each segment.

Human-like agents **self-evaluate** whether to continue:

| Signal | Agent Decision |
|--------|----------------|
| Unresolved reasoning threads | Continue - more to explore |
| User sent `"hmm"`, `"ok"`, `"go on"` | Continue - passive listening |
| User asked specific question | Pivot - respond to that |
| All threads resolved | Wait - natural stopping point |
| Low confidence on next topic | Pause - seek confirmation |

This enables extended analytical sessions:
```
Turn 1: Agent covers point 1, checks → 3 pending items → continues
Turn 2: Agent covers point 2, checks → 2 pending items → continues
Turn 3: Agent covers points 3-4, checks → 0 pending items → waits

[User received complete analysis without prompting each section]
```

### 2.4 Graceful Recovery

After any interruption or pause, conversation resumes coherently:

```
User: "Actually, continue what you were saying"
Agent: [Resumes from exact point of interruption - has accurate context]
```

This works because the system maintains **buffer coherence** - what the user saw is exactly what's persisted and what the model knows.

---

## 3. The Core Invariant: Buffer Coherence

Everything above depends on one principle:

```
User Perception = Persistent State = Model Context
```

At any moment, these three must be identical:
- If the user saw it → it's saved
- If it's saved → model knows it
- If model references it → user saw it

Violating this invariant causes:
- Phantom references (model mentions unseen content)
- Lost content (user saw it but it's not in context)
- State drift (what's saved ≠ what happened)

---

## 4. Architecture: How It Works

### 4.1 Process Separation

Two distinct processes enable the behaviors above:

| Process | Responsibility |
|---------|---------------|
| **State Manager** | Conversation state, tool execution, model interaction |
| **Delivery Process** | Response buffering, pacing, user-facing output |

This separation enables:
- State manager remains responsive during delivery
- Delivery can be interrupted without corrupting state
- Pacing is independent of generation speed

### 4.2 Dual-Buffer Design

```mermaid
flowchart LR
  M["Model output"] --> AB["Active buffer · accumulating"]:::accent
  AB -->|flush on deliver| FB["Flushed buffer · delivered"]
  FB --> PS["Persistent state"]
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

**Active Buffer**: Provisional content from model - not yet shown to user.

**Flushed Buffer**: Delivered content - the source of truth.

**Flow:**
1. Model generates content chunk
2. Chunk accumulated in active buffer
3. Complete semantic unit detected (sentence, paragraph)
4. Unit sent to delivery process with pacing
5. Delivery confirms presentation to user
6. Content moves to flushed buffer

**On interruption:**
- Active buffer discarded (undelivered)
- Flushed buffer persisted (delivered)
- Buffer coherence maintained

### 4.3 Continuation Engine

For autonomous continuation, the agent introspects after each segment:

```mermaid
flowchart LR
  I1["Pending reasoning items"] --> CC
  I2["User engagement signals"] --> CC
  I3["Topic completion state"] --> CC
  I4["Confidence level"] --> CC
  CC{"Continuation check"}:::accent --> O(["continue / wait"])
  classDef accent fill:#eb6623,stroke:#eb6623,color:#ffffff
```

The **scratch pad** (externalized reasoning state) fuels this:
- Agent adds items during analysis: `"Explore: X"`, `"Tension: A vs B"`
- Items persist across turns
- Agent marks items resolved with findings
- Pending items → continuation warranted

### 4.4 State Transitions

The agent operates as a state machine:

- **Ready**: Awaiting user input
- **Generating**: Model producing response
- **Delivering**: Content being paced to user
- **Executing**: Tool operations in progress

Key behaviors:
- User input can interrupt any state (except critical tool confirmation)
- Timeouts start after delivery completes, not after generation
- Tool results included only if execution completed before interruption

---

## 5. Guardrails

Autonomous and interruptible agents need boundaries:

| Guardrail | Purpose |
|-----------|---------|
| **Max continuation depth** | Prevent runaway monologues |
| **User interrupt priority** | Any user input breaks continuation loop |
| **Confidence threshold** | Low confidence → pause and check |
| **Topic drift detection** | Stay coherent to original query |
| **Timeout after delivery** | Don't timeout while still speaking |

---

## 6. Observability

Key metrics for human-like agents:

| Metric | Indicates |
|--------|-----------|
| Interruption rate | Response length/relevance calibration |
| Buffer discard ratio | Wasted generation |
| Continuation depth | Autonomous reasoning utilization |
| Coherence violations | Buffer management bugs |
| Pacing satisfaction | Delivery speed calibration |

---

## 7. Conclusion

Human-like conversational AI requires rethinking the atomic response model. By implementing natural pacing, true interruptibility, autonomous continuation, and strict buffer coherence, we create agents that feel like responsive participants rather than query endpoints.

The key insight: **what the user experiences, what gets saved, and what the model knows must always be identical**. This invariant, maintained through dual-buffer architecture and process separation, enables all the human-like behaviors users expect from natural conversation.

---

_We're hiring engineers who are excited about building the next generation of conversational AI. If designing stateful systems, real-time architectures, and human-like agent behaviors sounds like your kind of challenge, check out our open positions or reach out directly. Let's build thoughtful AI together._

## The Boring Designer

URL: https://inside.anuvaya.com/notes/the-boring-designer
Author: Nitesh Kumar Niranjan
Published: 2025-04-12

Being both a designer and engineer has given me a somewhat unique perspective on the complexity of User Interfaces. Each role - from building a social platform, to simplifying complex visa processes, to now developing consumer-focused AI/LLM applications - has reinforced a core truth: the best designs should feel invisible.

## The best design is boring.

I don't mean cold or lifeless interfaces. I mean design that's so thoughtful, so well-crafted, that it becomes invisible. Design that doesn't need to shout "look at me!" to be effective.

At Anuvaya Labs, we've built our design philosophy around this idea. We believe in:

- Making interfaces predictable, not clever
- Breaking patterns only when absolutely necessary (about 1% of the time)

## Why This Matters

Think about the apps you use every day. The ones you rely on. They're probably not the most visually exciting, but they work. They're consistent. They don't surprise you with fancy animations or constantly changing interfaces.

That's not by accident. That's good design.

## Our Approach to Aesthetics

We care deeply about aesthetics, but not in the way you might think:

- We let whitespace do the heavy lifting
- We use color to inform, not decorate
- We treat typography as our interface's voice
- We embrace the digital medium instead of fighting it

Most importantly, we recognize that perfect design doesn't exist. Sometimes business needs push against design purity. Sometimes we have to compromise. What matters is making these decisions consciously, with purpose, not from convenience.

## The 99% Rule

Here's how we think about it: 99% of your interface should be predictable, almost boring. Save those moments of delight for the 1% - those rare, specific interactions where breaking the pattern truly serves the user.

This approach requires more discipline than you might think. It's easier to add than to subtract. It's easier to be clever than to be clear. It's easier to follow trends than to stay focused on what works.

## A Different Kind of Design Culture

If you're nodding along to this, if the idea of thoughtful restraint resonates with you, if you believe that great design is about serving users rather than stroking egos - well, we might just think alike.

We're always looking for designers who:

- Find beauty in simplicity
- Think in systems, not pages
- Care about the why as much as the how
- Aren't afraid to be boring when boring works better

## Looking Ahead

Design trends come and go. What stays constant is the need for interfaces that serve real humans doing real work. That's what we're building at Anuvaya Labs, one boring-but-beautiful interface at a time.

If this resonates with you, if you find yourself fighting against unnecessary complexity in your own work, if you believe in the power of thoughtful restraint - we should talk. We're always looking for people who think deeply about design, even (especially) when that means embracing the beauty of boring.

---

_We're hiring designers / design engineers who believe in these principles. If this post resonated with you, check out our open positions or reach out directly. Let's build boring-but-beautiful things together._