RAG Retrieval: Chunking, Embeddings, Reranking, and an Eval

This series covers building a RAG pipeline to answer questions about the Anthropic documentation. A RAG agent answers questions by first searching a private knowledge base, then passing the relevant excerpts to an LLM as context — the model reads the actual source material before it responds, rather than guessing from training data.

Here the focus is the retrieval layer: how to chunk text, embed it, retrieve it, and measure whether retrieval is actually working.

By the end, the pipeline has a ChromaDB vector store, Voyage AI embeddings, reranking with graceful fallback, and an eval harness scoring 88% on its first run.

Chunking strategy

After scraping, the raw page text gets split into overlapping chunks before embedding. The parameters matter more than they appear to.

CHUNK_SIZE = 500     # words per chunk
CHUNK_OVERLAP = 50   # words of overlap between consecutive chunks

def chunk_text(text: str, url: str) -> list[dict]:
    words = text.split()
    chunks = []
    i = 0
    chunk_index = 0
    while i < len(words):
        chunk_words = words[i:i + CHUNK_SIZE]
        chunks.append({
            "text": " ".join(chunk_words),
            "source_url": url,
            "chunk_index": chunk_index
        })
        i += CHUNK_SIZE - CHUNK_OVERLAP
        chunk_index += 1
    return chunks

Why 500 words? Too small and chunks lose context — a chunk that says “The following example shows how to…” with the example cut off is useless for retrieval. Too large and chunks are unfocused — a 2,000-word chunk about tool use will match many different queries weakly instead of matching the right query strongly. 500 words is a reasonable starting point for documentation content; adjust based on your source material.

Why overlap? Concepts that span chunk boundaries don’t get lost. A 50-word overlap means the end of one chunk and the start of the next share content, so a query about something that straddles a boundary still finds relevant material.

The 14 pages ingested produced 76 chunks — an average of about 5-6 chunks per page, which feels right for documentation pages of varying length.

Embedding model consistency

The most important rule in RAG and the easiest to get wrong: use the same embedding model for ingestion and for querying.

Embedding models map text into a high-dimensional vector space where semantically similar content sits close together. If you embed your documents with model A and your queries with model B, the vectors live in different spaces. Similarity search becomes meaningless — you’re measuring distance between points that were never meant to be compared.

This project uses Voyage AI’s voyage-3 model for both:

# During ingestion
result = vo.embed(texts, model="voyage-3", input_type="document")

# During query
result = vo.embed([query], model="voyage-3", input_type="query")

The input_type parameter is Voyage-specific and worth understanding. Voyage optimizes embeddings differently depending on whether the text is a document being stored or a query being searched. Using input_type="document" during ingestion and input_type="query" at query time isn’t optional — it’s part of how the model is designed to be used and measurably improves retrieval quality.

Vector storage with ChromaDB

ChromaDB is a local, file-based vector store: no signup, no infrastructure, no cloud account. It’s the right choice for a portfolio project and easy to swap for Pinecone, Weaviate, or pgvector in production.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("anthropic_docs")

collection.add(
    ids=ids,
    embeddings=embeddings,
    documents=texts,
    metadatas=metadatas
)

The PersistentClient writes to disk, so the vector index survives process restarts. chroma_db/ is excluded from git — it’s a build artifact, not source code. Anyone cloning the repo runs ingest.py to build their own local index.

At query time:

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=TOP_K,
    include=["documents", "metadatas", "distances"]
)

distances here are cosine distances — lower is more similar. These aren’t used directly in the final answer but are useful for debugging retrieval quality.

Reranking

Vector similarity search is fast but imprecise. It compares the query embedding to every document embedding independently, with no knowledge of how the query and document relate to each other as text.

Reranking fixes this by running a second pass over the retrieved candidates using a cross-encoder model that reads the query and each document together and scores their relevance as a pair. This is slower but significantly more accurate for ambiguous queries.

The pattern: retrieve a larger candidate set with vector similarity, then rerank to a smaller final set.

COLLECTION_NAME = "anthropic_docs"
TOP_K = 10          # Retrieve more candidates initially
RERANK_TOP_K = 5    # Keep top 5 after reranking

def retrieve(query: str) -> list[dict]:
    vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_collection(COLLECTION_NAME)

    # Embed query
    result = vo.embed([query], model="voyage-3", input_type="query")
    query_embedding = result.embeddings[0]

    # Retrieve top-10 by vector similarity
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=TOP_K,
        include=["documents", "metadatas", "distances"]
    )

    candidates = [
        {
            "text": results["documents"][0][i],
            "source_url": results["metadatas"][0][i]["source_url"],
            "distance": results["distances"][0][i]
        }
        for i in range(len(results["documents"][0]))
    ]

    # Rerank candidates, fall back to vector order on rate limit
    docs = [c["text"] for c in candidates]
    try:
        reranked = vo.rerank(query, docs, model="rerank-2", top_k=RERANK_TOP_K)
        return [candidates[r.index] for r in reranked.results]
    except voyageai.error.RateLimitError:
        return candidates[:RERANK_TOP_K]

The try/except around vo.rerank() deserves explanation. It’s not just defensive coding; it’s the result of a specific failure mode that took a few iterations to handle correctly.

The rate limit problem

Voyage AI’s free tier enforces two limits simultaneously:

3 requests per minute (RPM)
10,000 tokens per minute (TPM)

With reranking added, each query now makes two Voyage AI calls: one embed call and one rerank call. The eval harness runs 8 tests back-to-back, which means 16 Voyage AI calls total.

First attempt: add a sleep between tests. A 5-second sleep between tests was fine before reranking (1 call per test = ~12 calls/minute, well under 3 RPM). After reranking it doubled to 2 calls per test, immediately hitting the RPM cap. The eval crashed on test 2 every time.

Second attempt: increase the sleep. Bumping to 65 seconds, then 90 seconds between tests. Still crashed on test 3. The problem wasn’t just RPM — it was TPM.

With TOP_K=10 and chunks averaging ~500 tokens each, each rerank call sends approximately 5,000 tokens to Voyage AI. Two consecutive tests consumed the full 10,000 TPM budget. Test 3’s rerank was rejected regardless of how long we waited between tests, because the token budget was exhausted within the rolling window.

Third attempt: graceful fallback. Rather than keep inflating the sleep (which would make the eval take 15+ minutes to run), the rerank call was made to fail gracefully. A try/except catches RateLimitError and falls back to the original vector similarity order.

This means:

The eval completes fully instead of crashing
The answer is still useful — vector similarity is good, just not reranked
Production is now resilient to rate spikes, not just development

The fallback triggered on test 3 (extended thinking). It still passed — the correct source was already in the top 5 by vector similarity. That’s the right outcome: reranking improves quality when available, but the pipeline degrades gracefully when it isn’t.

The broader lesson: when you add a third-party dependency to a hot path, design for its failure from the start. The fallback isn’t a workaround — it’s the correct architecture.

The eval harness

The approach is simple: for each test question, define the source URL that should appear in the retrieved chunks. If it does, the test passes.

import time
from rag import query

TEST_CASES = [
    {
        "question": "How do I implement tool use with Claude?",
        "expected_source": "tool-use/implement-tool-use",
    },
    {
        "question": "What Claude models are currently available?",
        "expected_source": "models/overview",
    },
    {
        "question": "How does extended thinking work?",
        "expected_source": "extended-thinking",
    },
    {
        "question": "What is the difference between client and server tools?",
        "expected_source": "tool-use",
    },
    {
        "question": "How do I use few-shot examples in prompts?",
        "expected_source": "prompt-engineering/use-examples",
    },
    {
        "question": "What is chain of thought prompting?",
        "expected_source": "prompt-engineering/chain-of-thought",
    },
    {
        "question": "How do I get Claude to be more direct and clear?",
        "expected_source": "prompt-engineering/be-clear-and-direct",
    },
    {
        "question": "How do I use Claude for computer use tasks?",
        "expected_source": "computer-use",
    },
]

def run_eval():
    results = []
    passed = 0

    for i, case in enumerate(TEST_CASES):
        print(f"Running test {i + 1}/{len(TEST_CASES)}: {case['question'][:60]}...")
        result = query(case["question"])
        hit = any(case["expected_source"] in s for s in result["sources"])
        status = "PASS" if hit else "FAIL"
        if hit:
            passed += 1
        results.append({**case, "retrieved": result["sources"], "pass": hit})
        print(f"  {status}")
        if i < len(TEST_CASES) - 1:
            time.sleep(5)

    print(f"\nResults: {passed}/{len(TEST_CASES)} passed ({passed/len(TEST_CASES)*100:.0f}%)")
    return passed, len(TEST_CASES)

if __name__ == "__main__":
    run_eval()

Results

Test	Result
Tool use implementation	PASS
Available models	PASS
Extended thinking	PASS (rerank fell back to vector similarity)
Client vs server tools	PASS
Few-shot examples	FAIL
Chain of thought	PASS
Be clear and direct	PASS
Computer use	PASS

7/8 passing (88%). The one failure is a corpus gap: the prompt-engineering/use-examples page wasn’t in the ingested URL list. Adding it and re-ingesting would bring this to 8/8. The next post goes deeper on why that distinction matters and how to write test cases that catch it.

Streaming responses

One UI improvement worth calling out: replacing the full-response spinner with streaming token-by-token rendering.

Claude’s API supports streaming via client.messages.stream(). The FastAPI endpoint sends a server-sent events (SSE) stream, and Streamlit renders each token as it arrives:

# In rag.py
def query_stream(question: str):
    chunks = retrieve(question)
    prompt = build_prompt(question, chunks)
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    sources = [c["source_url"] for c in chunks]

    with client.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            yield text

    yield {"sources": sources, "chunks_used": len(chunks)}

The perceived latency difference is significant. A 10-second response that streams feels faster than a 3-second response that produces nothing then dumps everything at once. For a demo this matters.

The broader point

Retrieval quality is bounded by corpus quality.

The one eval failure wasn’t caused by a bad embedding model, poor chunking, or broken reranking. It was caused by a missing URL. No amount of tuning the retrieval pipeline would have fixed it — the page simply wasn’t there to retrieve.

In practice, the ceiling on answer quality is set at ingestion time. If the right content isn’t in your vector store, the retrieval can’t surface it. Improving retrieval quality and improving corpus quality are both necessary, and corpus quality comes first.

A related point: not every page on a docs site is useful for RAG. Auto-generated API reference pages, changelogs, and heavily structured schema tables tend to produce chunks that don’t retrieve well and consume embedding budget without improving answer quality. The Anthropic docs’ /en/api/messages page — 88 chunks of serialized schema content — was cut from this corpus for exactly that reason. Removing it also eliminated the TPM cap problem during ingestion: at ~39,000 words it would have blown Voyage AI’s 10,000-token-per- minute free-tier budget in a single batch regardless of delay between calls. Curating your URL list is part of corpus design, not just an ingestion detail.

Before tuning TOP_K, adjusting chunk size, or adding reranking, make sure the content you need is actually ingested — and that content you don’t need isn’t. An eval harness makes corpus gaps visible; without one, you’d tune retrieval parameters for hours trying to fix a problem that has nothing to do with retrieval.

Next: How to Design RAG Eval Test Cases

Source code

Full project: github.com/tylerwellss/rag-agent

Voyage AI docs: docs.voyageai.com

ChromaDB docs: docs.trychroma.com

Chunking strategy#

Embedding model consistency#

Vector storage with ChromaDB#

Reranking#

The rate limit problem#

The eval harness#

Results#

Streaming responses#

The broader point#

Series navigation#

Source code#