Eval on Tyler Wells

Eval on Tyler Wellshttps://blog-theta-seven-23.vercel.app/tags/eval/Recent content in Eval on Tyler WellsHugoen-usMon, 26 Jan 2026 00:00:00 +0000Scoring RAG Answer Quality with an LLM Judgehttps://blog-theta-seven-23.vercel.app/posts/rag-answer-judging/Mon, 26 Jan 2026 00:00:00 +0000https://blog-theta-seven-23.vercel.app/posts/rag-answer-judging/[RAG Series 3/3] Source URL retrieval tells you whether the right content was retrieved. It doesn't tell you whether the answer was any good. Adding an LLM judge to the eval harness reveals two failure modes that retrieval scoring alone can't see.How to Design RAG Eval Test Caseshttps://blog-theta-seven-23.vercel.app/posts/design-rag-eval-test-cases/Sat, 24 Jan 2026 00:00:00 +0000https://blog-theta-seven-23.vercel.app/posts/design-rag-eval-test-cases/[RAG Series 2/3] How to write test cases that catch real retrieval problems, why source URL retrieval is a useful proxy metric, and when it isn't enough.RAG Retrieval: Chunking, Embeddings, Reranking, and an Evalhttps://blog-theta-seven-23.vercel.app/posts/rag-retrieval-quality/Thu, 22 Jan 2026 00:00:00 +0000https://blog-theta-seven-23.vercel.app/posts/rag-retrieval-quality/[RAG Series 1/3] Covers chunking strategy, embedding model consistency, reranking, and building an eval harness — including what happened when Voyage AI's free-tier rate limits forced a more resilient architecture.