The previous post in this series built an eval harness that scores retrieval quality: does the right documentation page appear in the retrieved chunks? 7/8 passing, 88%. A useful signal.

But retrieval quality and answer quality are different things. A test can pass retrieval scoring and still produce a bad answer. A test can fail retrieval scoring and still produce a correct one. Source URL retrieval is a proxy — a fast, cheap proxy that catches a lot of problems, but not all of them.

This post adds an LLM judge to the eval: after Claude generates an answer, a second Claude call scores whether that answer is accurate, grounded, and complete. Running both together reveals failure modes that neither check catches alone.


How the LLM judge works

The judge takes three inputs: the original question, the retrieved context (the raw chunk text passed to Claude), and the generated answer. It returns a score from 1 to 3 and a one-sentence explanation.

def judge_answer(question: str, context: str, answer: str) -> dict:
    """Use Claude as a judge to score answer quality."""
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

    judge_prompt = f"""
    You are evaluating the quality of an AI-generated answer
    to a question about the Anthropic documentation.

    Question: {question}

    Context provided to the AI (retrieved documentation chunks):
    {context}

    AI-generated answer:
    {answer}

    Score the answer on the following scale:
    1 - Poor: The answer is incorrect, hallucinated, or contradicts the context
    2 - Acceptable: The answer is mostly correct but incomplete or imprecise
    3 - Good: The answer is accurate, grounded in the context, and directly
        addresses the question

    Respond with JSON only in this exact format:
    {{"score": <1, 2, or 3>, "reasoning": "<one sentence explanation>"}}
    """

    message = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": judge_prompt}]
    )

    return json.loads(message.content[0].text)

The scoring rubric matters. A 1-3 scale is deliberate: binary pass/fail loses too much signal, and a 1-5 scale introduces too much subjectivity at the margins. The judge needs to distinguish between wrong (1), incomplete (2), and correct (3). Those three categories are meaningful and consistent.

The judge also needs access to the retrieved context, not just the question and answer. Without context, the judge can only evaluate whether the answer sounds correct. It can’t evaluate whether it’s grounded, and grounding is the whole point of RAG.

This required a small change to query() in rag.py — adding context to the return value:

context = "\n\n---\n\n".join(
    f"[{c['source_url']}]\n{c['text']}" for c in chunks
)

return {
    "answer": message.content[0].text,
    "sources": [c["source_url"] for c in chunks],
    "chunks_used": len(chunks),
    "context": context
}

Results

============================================================
Retrieval: 7/8 passing (88%)
Answer quality: 2.9/3.0 average
============================================================

Six of eight answers scored 3/3. One scored 2/3. The retrieval failure from the previous eval still fails retrieval — but its answer scored 3/3.

The full results:

TestRetrievalQualityNotes
Implement tool usePASS3/3
Available modelsPASS3/3Rerank fell back
Extended thinkingPASS3/3Rerank fell back
Client vs server toolsPASS3/3
Few-shot examplesFAIL3/3Retrieval failed, answer correct
Chain of thoughtPASS2/3Retrieval passed, answer incomplete
Be clear and directPASS3/3Rerank fell back
Computer usePASS3/3

Two cases break the expected pattern. Both are worth examining closely.


Failure mode 1: retrieval fails, answer is still correct

Test 5 (few-shot examples) failed retrieval. The prompt-engineering/use-examples page wasn’t retrieved. Instead, Claude received chunks from implement-tool-use, be-clear-and-direct, chain-of-thought, and computer-use.

The answer scored 3/3. The judge noted it was “directly grounded in the provided context without hallucination” and covered best practices for few-shot prompting including the recommended 3-5 example range and XML tag structure.

How? The topic of few-shot examples appears across multiple pages in the Anthropic docs — prompt engineering concepts bleed into tool use guides, chain-of-thought explanations, and elsewhere. Claude found enough signal in the retrieved chunks to construct a correct answer even without the dedicated page.

This is the most interesting finding in the eval and the most important one to understand. It looks like a success — the answer is correct — but it exposes a real weakness in the grounding guarantee.

The grounding guarantee is weaker than it appears when topics overlap.

RAG is supposed to ensure that answers are grounded in retrieved content, not generated from training data. But when a topic appears across many pages, Claude can answer correctly using retrieved content that was retrieved for the wrong reasons. The answer is grounded — just not in the content you intended to ground it in.

In this case that’s fine. In a higher-stakes domain — medical information, legal guidance, internal policy documentation — a correct answer from the wrong source is a problem, because the same mechanism that produced a correct answer today might produce a confidently wrong one tomorrow when the query is slightly different.

The fix is to add the use-examples URL to the corpus. Once that page is ingested, retrieval will correctly surface it and the grounding will be intentional rather than incidental. But the pattern is worth knowing: a corpus gap doesn’t always produce a bad answer, which means retrieval scoring alone won’t catch every corpus gap.


Failure mode 2: retrieval passes, answer is incomplete

Test 6 (chain of thought prompting) passed retrieval. The extended-thinking page was retrieved, which contains extensive discussion of thinking blocks, structured reasoning, and manual chain-of-thought techniques.

The answer scored 2/3. The judge’s reasoning was specific: “The answer correctly identifies that chain of thought prompting involves step-by-step reasoning and appropriately acknowledges that the provided context doesn’t contain a complete definition, but it misses the opportunity to infer more from the extensive discussion of thinking blocks, structured reasoning, and manual CoT techniques throughout the context.”

In other words: the right content was there, Claude found it, and then Claude told the user “I don’t have a complete definition” instead of synthesizing an answer from what it did have.

This is a prompt engineering problem, not a retrieval problem. The system prompt instructs Claude to say so clearly if the context doesn’t contain enough information to answer the question. That instruction is correct for cases where the context genuinely doesn’t help — but here it caused Claude to be unnecessarily conservative about content that was present and relevant.

The fix is to refine the system prompt:

SYSTEM_PROMPT = """You are a helpful assistant that answers questions about
the Anthropic documentation. Answer based on the provided context.
If the context contains relevant information but not a complete definition,
synthesize an answer from what is available and note what additional detail
can be found in the documentation. Only say the context is insufficient if
it contains no relevant information at all.
Always cite the source URL when referencing specific documentation."""

The change: replace “If the context doesn’t contain enough information, say so clearly” with more nuanced guidance that distinguishes between no relevant information (say so) and partial relevant information (synthesize and note the gap).

This is the kind of refinement that only surfaces through evaluation. Without the judge, test 6 would look like a pass — retrieval succeeded, an answer was generated, the pipeline ran without errors. The judge caught that the answer was technically correct but less useful than it could have been.


The rerank rate limit pattern

Four of eight tests hit Voyage AI’s free-tier TPM cap on the rerank call and fell back to vector similarity order. Tests 2, 3, 5, and 7 all logged:

[WARN] Rerank rate-limited; falling back to vector similarity order

All four still passed retrieval and three of the four scored 3/3 on answer quality (test 5 failed retrieval for a different reason). The fallback architecture from the previous post held up — reranking improves quality when available, but the pipeline degrades gracefully without it.

On a paid Voyage AI tier the rate limits are substantially higher and the fallback would trigger rarely or never. On our lowly free tier, the fallback is the correct design. The eval confirms it works.


What the combined eval tells you

Running retrieval scoring and answer quality scoring together gives you a 2x2 matrix of outcomes:

Retrieval pass + Quality 3/3: Everything working as intended. The right content was retrieved and used well. Six of eight tests landed here.

Retrieval pass + Quality 2/3: The right content was retrieved but the answer was incomplete or imprecise. A prompt engineering problem. One test landed here (chain of thought).

Retrieval fail + Quality 3/3: The right page wasn’t retrieved but the answer was correct anyway — topic overlap across the corpus. A corpus gap that doesn’t manifest as a visible failure. One test landed here (few-shot examples). This is the most dangerous quadrant because it’s invisible to retrieval scoring alone.

Retrieval fail + Quality 1-2: Both retrieval and generation failed. A clear problem. Zero tests landed here in this eval.

The third quadrant — retrieval fail, quality pass — is the one that source URL scoring alone can’t see. Without the judge, the few-shot examples test would have looked like a retrieval failure with an unknown answer quality. With the judge, you can see that the answer was actually correct, understand why, and make a deliberate decision about whether that’s acceptable.


Is an LLM judge reliable?

Using Claude to evaluate Claude’s own outputs is a reasonable concern. A few mitigations worth noting:

The judge evaluates against a concrete criteria: is the answer grounded in the provided context?

The judge has access to the retrieved context, not just the answer. It’s evaluating grounding, which is verifiable, not correctness in the abstract, which isn’t.

For this small project, the judge is reliable enough to be useful. For production use, a stronger approach is to use a different model as the judge — if you’re running Claude as the generator, use GPT-4 as the judge, or vice versa. Cross-model evaluation reduces the risk of shared blind spots.


The final eval harness

After three posts, the eval harness measures:

  • Retrieval quality: does the right source appear in retrieved chunks?
  • Answer quality: is the answer accurate, grounded, and complete?
  • Failure mode classification: retrieval fail, quality fail, or both?
  • Rerank fallback behavior: is the pipeline degrading gracefully under rate limits?

7/8 retrieval (88%), 2.9/3.0 answer quality. One corpus gap, one prompt engineering opportunity. Both have clear fixes. The pipeline is working.


Source code


Series navigation

Previous: How to Design RAG Eval Test Cases

Full project: github.com/tylerwellss/rag-agent

Voyage AI docs: docs.voyageai.com

Anthropic API docs: docs.anthropic.com