A working RAG pipeline is easy. Knowing whether it will keep working after you change something is harder, and most projects skip that part entirely.
Here the focus is designing an eval harness that catches real problems, using the Anthropic docs RAG agent as the example.
What an eval harness does
An eval harness is a script that runs a fixed set of test cases against your pipeline and produces a pass/fail score. Run it before and after a change — if the score drops, the change broke something. If it improves, the change helped.
The value isn’t in any single run. It’s in repeatability. An eval you can run in five minutes and trust lets you make changes confidently. Without one, every change is a guess.
For a RAG pipeline specifically, an eval should answer: does my retrieval surface the right content for these questions?
Choosing a proxy metric
There are two ways to score RAG quality:
Source URL retrieval — check whether the expected documentation page appears in the retrieved chunks. Fast, cheap, deterministic.
Answer quality — have a second LLM judge whether the generated answer correctly addresses the question. More accurate, but slower, more expensive, and harder to make stable across runs.
This post covers source URL retrieval. It’s the right starting point for most RAG projects because it’s cheap enough to run on every change and gives you a clear signal about retrieval quality, separate from generation quality. Answer quality scoring can come later.
How to write good test cases
A test case has two parts: a question and an expected source URL. Writing good test cases is harder than it sounds.
Match questions to specific pages
The question should have one clearly correct source. Avoid questions that could legitimately be answered from multiple pages — they produce ambiguous results that are hard to act on.
Too broad:
{"question": "How does Claude work?", "expected_source": "welcome"}
This question could pull from a dozen pages. If it fails, you don’t know which retrieval problem to fix.
More specific:
{
"question": "How do I implement tool use with Claude?",
"expected_source": "tool-use/implement-tool-use"
}
This question has one right answer in the corpus. If it fails, you know exactly what retrieval missed.
Test the pages you care most about
Your corpus has pages of varying importance. The pages that matter most for your use case should have test cases. For the Anthropic docs RAG agent, the highest-value pages are the ones developers are most likely to ask about: tool use, models, prompt engineering. Those get test cases. The welcome page and getting-started page don’t need dedicated tests — they’re thin and unlikely to be the correct source for any specific question.
Write questions that sound like real user queries
Test questions should match how users actually phrase things, not how the documentation is organized. Users don’t ask “what is the extended thinking feature overview” — they ask “how does extended thinking work.” The mismatch between documentation structure and natural language is exactly what embedding models are supposed to handle, and your tests should exercise that.
Documentation-flavored (avoid):
{"question": "Extended thinking feature overview", "expected_source": "extended-thinking"}
Natural language (prefer):
{"question": "How does extended thinking work?", "expected_source": "extended-thinking"}
Include cases that stress retrieval
Questions that use the exact terminology of the source page will pass almost always. The useful tests are the ones that require the retrieval to work semantically, not lexically.
For example, “prompt engineering” appears verbatim in several page titles. A question like “how do I write better prompts?” doesn’t use that exact phrase and requires the embedding model to understand that “write better prompts” relates to “prompt engineering.” That’s a more meaningful test.
The test cases for this project
Here’s the full set used for the Anthropic docs RAG agent, with notes on the reasoning behind each:
TEST_CASES = [
{
# Tests the most important page in the corpus.
# Uses natural phrasing rather than the page title.
"question": "How do I implement tool use with Claude?",
"expected_source": "tool-use/implement-tool-use",
},
{
# Short, direct question. The models page should rank clearly.
"question": "What Claude models are currently available?",
"expected_source": "models/overview",
},
{
# Tests a feature with a dedicated page and distinct terminology.
"question": "How does extended thinking work?",
"expected_source": "extended-thinking",
},
{
# Tests whether the retriever distinguishes between two related pages.
# Known weak spot: short pages with overlapping content tend to lose
# to their parent page. Included deliberately to surface this.
"question": "What is the difference between client and server tools?",
"expected_source": "tool-use",
},
{
# Tests a prompt engineering sub-page with specific terminology.
# "Few-shot examples" is the natural phrasing for this concept.
"question": "How do I use few-shot examples in prompts?",
"expected_source": "prompt-engineering/use-examples",
},
{
# Tests another prompt engineering sub-page.
# "Chain of thought" is well-established terminology likely
# to appear in the source page.
"question": "What is chain of thought prompting?",
"expected_source": "prompt-engineering/chain-of-thought",
},
{
# Tests a page with a specific, actionable framing.
# The natural phrasing ("direct and clear") matches the page title
# closely, making this an easier test.
"question": "How do I get Claude to be more direct and clear?",
"expected_source": "prompt-engineering/be-clear-and-direct",
},
{
# Tests a page with distinctive terminology.
# "Computer use" is specific enough that retrieval should work well.
"question": "How do I use Claude for computer use tasks?",
"expected_source": "computer-use",
},
]
What the results tell you
Running this against the pipeline produced 7/8 passing (88%). The one
failure — few-shot examples — was caused by a missing URL in the ingested
corpus. The prompt-engineering/use-examples page wasn’t scraped, so it
couldn’t be retrieved regardless of how good the retrieval was.
What this eval makes clear is the difference between a retrieval problem and a corpus problem. Without the eval, a failure on the few-shot examples question would look like bad retrieval. With it, you can see that every other test passed. The retrieval is working, the content just isn’t there.
The fix for a corpus gap is to add the URL and re-ingest. The fix for a retrieval problem is to adjust TOP_K, tune chunk size, or improve the embedding strategy. They’re completely different interventions, and the eval tells you which one you need.
What the results don’t tell you
A few things worth being explicit about:
Passing doesn’t mean the answer is correct. The eval checks whether the right page was retrieved. It doesn’t check whether Claude used that content accurately in the generated answer. A retrieved page can still produce a bad answer if the relevant information was in a different chunk than the ones returned, or if the prompt doesn’t give Claude enough context to use the retrieved content well.
Failing doesn’t always mean retrieval is broken. The Q4 test (client vs server tools) passed, but the dedicated sub-pages for those topics didn’t surface — the parent tool-use page ranked higher because its content overlaps with the sub-pages and it has more chunks. That’s technically a pass but it’s worth noting that the most relevant pages weren’t retrieved. A more granular eval would check for the sub-page URLs specifically.
88% on 8 questions is not a statistically meaningful number. It’s a useful development signal, not a production quality metric. With 8 test cases, one failure moves you from 100% to 88%. A real production eval would have 50-100+ test cases to smooth out that variance.
Maintaining the eval over time
An eval is most useful when it’s easy to run and run often. A few practices that help:
Add a test case every time you add a URL. When you expand the corpus, write a test question for the new page before you run ingestion. This keeps coverage proportional to corpus size and catches regressions immediately.
Don’t delete failing test cases. If a test fails and you fix the underlying problem, keep the test. Deleting failing tests is how you end up with an eval that always passes but doesn’t catch anything.
Track scores over time. Even a simple log of “date, score, what changed” is enough to see whether your changes are consistently improving or degrading retrieval quality.
The next step: answer quality scoring
Source URL retrieval is a ceiling-level check — it tells you whether the right content was retrieved, not whether it was used well. The natural next step is adding an LLM judge that scores answer quality directly: send the question, the retrieved chunks, and the generated answer to Claude and ask it to rate whether the answer is accurate, complete, and grounded in the context.
That’s a more expensive eval to run but it catches a different class of problems. The next post in this series covers building it.
Series navigation
Previous: RAG Retrieval: Chunking, Embeddings, Reranking, and an Eval
Next: Scoring RAG Answer Quality with an LLM Judge
Source code
Full project: github.com/tylerwellss/rag-agent