Eval | Tyler Wells

Scoring RAG Answer Quality with an LLM Judge

The previous post in this series built an eval harness that scores retrieval quality: does the right documentation page appear in the retrieved chunks? 7/8 passing, 88%. A useful signal. But retrieval quality and answer quality are different things. A test can pass retrieval scoring and still produce a bad answer. A test can fail retrieval scoring and still produce a correct one. Source URL retrieval is a proxy — a fast, cheap proxy that catches a lot of problems, but not all of them. ...

How to Design RAG Eval Test Cases

A working RAG pipeline is easy. Knowing whether it will keep working after you change something is harder, and most projects skip that part entirely. Here the focus is designing an eval harness that catches real problems, using the Anthropic docs RAG agent as the example. What an eval harness does An eval harness is a script that runs a fixed set of test cases against your pipeline and produces a pass/fail score. Run it before and after a change — if the score drops, the change broke something. If it improves, the change helped. ...

RAG Retrieval: Chunking, Embeddings, Reranking, and an Eval

This series covers building a RAG pipeline to answer questions about the Anthropic documentation. A RAG agent answers questions by first searching a private knowledge base, then passing the relevant excerpts to an LLM as context — the model reads the actual source material before it responds, rather than guessing from training data. Here the focus is the retrieval layer: how to chunk text, embed it, retrieve it, and measure whether retrieval is actually working. ...