Scoring RAG Answer Quality with an LLM Judge
The previous post in this series built an eval harness that scores retrieval quality: does the right documentation page appear in the retrieved chunks? 7/8 passing, 88%. A useful signal. But retrieval quality and answer quality are different things. A test can pass retrieval scoring and still produce a bad answer. A test can fail retrieval scoring and still produce a correct one. Source URL retrieval is a proxy — a fast, cheap proxy that catches a lot of problems, but not all of them. ...