This series covers building a RAG pipeline to answer questions about the Anthropic documentation. A RAG agent answers questions by first searching a private knowledge base, then passing the relevant excerpts to an LLM as context — the model reads the actual source material before it responds, rather than guessing from training data.
Here the focus is the retrieval layer: how to chunk text, embed it, retrieve it, and measure whether retrieval is actually working.
By the end, the pipeline has a ChromaDB vector store, Voyage AI embeddings, reranking with graceful fallback, and an eval harness scoring 88% on its first run.
Chunking strategy
After scraping, the raw page text gets split into overlapping chunks before embedding. The parameters matter more than they appear to.
CHUNK_SIZE = 500 # words per chunk
CHUNK_OVERLAP = 50 # words of overlap between consecutive chunks
def chunk_text(text: str, url: str) -> list[dict]:
words = text.split()
chunks = []
i = 0
chunk_index = 0
while i < len(words):
chunk_words = words[i:i + CHUNK_SIZE]
chunks.append({
"text": " ".join(chunk_words),
"source_url": url,
"chunk_index": chunk_index
})
i += CHUNK_SIZE - CHUNK_OVERLAP
chunk_index += 1
return chunks
Why 500 words? Too small and chunks lose context — a chunk that says “The following example shows how to…” with the example cut off is useless for retrieval. Too large and chunks are unfocused — a 2,000-word chunk about tool use will match many different queries weakly instead of matching the right query strongly. 500 words is a reasonable starting point for documentation content; adjust based on your source material.
Why overlap? Concepts that span chunk boundaries don’t get lost. A 50-word overlap means the end of one chunk and the start of the next share content, so a query about something that straddles a boundary still finds relevant material.
The 14 pages ingested produced 76 chunks — an average of about 5-6 chunks per page, which feels right for documentation pages of varying length.
Embedding model consistency
The most important rule in RAG and the easiest to get wrong: use the same embedding model for ingestion and for querying.
Embedding models map text into a high-dimensional vector space where semantically similar content sits close together. If you embed your documents with model A and your queries with model B, the vectors live in different spaces. Similarity search becomes meaningless — you’re measuring distance between points that were never meant to be compared.
This project uses Voyage AI’s voyage-3 model for both:
# During ingestion
result = vo.embed(texts, model="voyage-3", input_type="document")
# During query
result = vo.embed([query], model="voyage-3", input_type="query")
The input_type parameter is Voyage-specific and worth understanding.
Voyage optimizes embeddings differently depending on whether the text is a
document being stored or a query being searched. Using input_type="document"
during ingestion and input_type="query" at query time isn’t optional — it’s
part of how the model is designed to be used and measurably improves retrieval
quality.
Vector storage with ChromaDB
ChromaDB is a local, file-based vector store: no signup, no infrastructure, no cloud account. It’s the right choice for a portfolio project and easy to swap for Pinecone, Weaviate, or pgvector in production.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("anthropic_docs")
collection.add(
ids=ids,
embeddings=embeddings,
documents=texts,
metadatas=metadatas
)
The PersistentClient writes to disk, so the vector index survives process
restarts. chroma_db/ is excluded from git — it’s a build artifact, not
source code. Anyone cloning the repo runs ingest.py to build their own
local index.
At query time:
results = collection.query(
query_embeddings=[query_embedding],
n_results=TOP_K,
include=["documents", "metadatas", "distances"]
)
distances here are cosine distances — lower is more similar. These aren’t
used directly in the final answer but are useful for debugging retrieval
quality.
Reranking
Vector similarity search is fast but imprecise. It compares the query embedding to every document embedding independently, with no knowledge of how the query and document relate to each other as text.
Reranking fixes this by running a second pass over the retrieved candidates using a cross-encoder model that reads the query and each document together and scores their relevance as a pair. This is slower but significantly more accurate for ambiguous queries.
The pattern: retrieve a larger candidate set with vector similarity, then rerank to a smaller final set.
COLLECTION_NAME = "anthropic_docs"
TOP_K = 10 # Retrieve more candidates initially
RERANK_TOP_K = 5 # Keep top 5 after reranking
def retrieve(query: str) -> list[dict]:
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(COLLECTION_NAME)
# Embed query
result = vo.embed([query], model="voyage-3", input_type="query")
query_embedding = result.embeddings[0]
# Retrieve top-10 by vector similarity
results = collection.query(
query_embeddings=[query_embedding],
n_results=TOP_K,
include=["documents", "metadatas", "distances"]
)
candidates = [
{
"text": results["documents"][0][i],
"source_url": results["metadatas"][0][i]["source_url"],
"distance": results["distances"][0][i]
}
for i in range(len(results["documents"][0]))
]
# Rerank candidates, fall back to vector order on rate limit
docs = [c["text"] for c in candidates]
try:
reranked = vo.rerank(query, docs, model="rerank-2", top_k=RERANK_TOP_K)
return [candidates[r.index] for r in reranked.results]
except voyageai.error.RateLimitError:
return candidates[:RERANK_TOP_K]
The try/except around vo.rerank() deserves explanation. It’s not just
defensive coding; it’s the result of a specific failure mode that took a few
iterations to handle correctly.
The rate limit problem
Voyage AI’s free tier enforces two limits simultaneously:
- 3 requests per minute (RPM)
- 10,000 tokens per minute (TPM)
With reranking added, each query now makes two Voyage AI calls: one embed call and one rerank call. The eval harness runs 8 tests back-to-back, which means 16 Voyage AI calls total.
First attempt: add a sleep between tests. A 5-second sleep between tests was fine before reranking (1 call per test = ~12 calls/minute, well under 3 RPM). After reranking it doubled to 2 calls per test, immediately hitting the RPM cap. The eval crashed on test 2 every time.
Second attempt: increase the sleep. Bumping to 65 seconds, then 90 seconds between tests. Still crashed on test 3. The problem wasn’t just RPM — it was TPM.
With TOP_K=10 and chunks averaging ~500 tokens each, each rerank call sends
approximately 5,000 tokens to Voyage AI. Two consecutive tests consumed the
full 10,000 TPM budget. Test 3’s rerank was rejected regardless of how long
we waited between tests, because the token budget was exhausted within the
rolling window.
Third attempt: graceful fallback.
Rather than keep inflating the sleep (which would make the eval take 15+
minutes to run), the rerank call was made to fail gracefully. A
try/except catches RateLimitError and falls back to the original vector
similarity order.
This means:
- The eval completes fully instead of crashing
- The answer is still useful — vector similarity is good, just not reranked
- Production is now resilient to rate spikes, not just development
The fallback triggered on test 3 (extended thinking). It still passed — the correct source was already in the top 5 by vector similarity. That’s the right outcome: reranking improves quality when available, but the pipeline degrades gracefully when it isn’t.
The broader lesson: when you add a third-party dependency to a hot path, design for its failure from the start. The fallback isn’t a workaround — it’s the correct architecture.
The eval harness
The approach is simple: for each test question, define the source URL that should appear in the retrieved chunks. If it does, the test passes.
import time
from rag import query
TEST_CASES = [
{
"question": "How do I implement tool use with Claude?",
"expected_source": "tool-use/implement-tool-use",
},
{
"question": "What Claude models are currently available?",
"expected_source": "models/overview",
},
{
"question": "How does extended thinking work?",
"expected_source": "extended-thinking",
},
{
"question": "What is the difference between client and server tools?",
"expected_source": "tool-use",
},
{
"question": "How do I use few-shot examples in prompts?",
"expected_source": "prompt-engineering/use-examples",
},
{
"question": "What is chain of thought prompting?",
"expected_source": "prompt-engineering/chain-of-thought",
},
{
"question": "How do I get Claude to be more direct and clear?",
"expected_source": "prompt-engineering/be-clear-and-direct",
},
{
"question": "How do I use Claude for computer use tasks?",
"expected_source": "computer-use",
},
]
def run_eval():
results = []
passed = 0
for i, case in enumerate(TEST_CASES):
print(f"Running test {i + 1}/{len(TEST_CASES)}: {case['question'][:60]}...")
result = query(case["question"])
hit = any(case["expected_source"] in s for s in result["sources"])
status = "PASS" if hit else "FAIL"
if hit:
passed += 1
results.append({**case, "retrieved": result["sources"], "pass": hit})
print(f" {status}")
if i < len(TEST_CASES) - 1:
time.sleep(5)
print(f"\nResults: {passed}/{len(TEST_CASES)} passed ({passed/len(TEST_CASES)*100:.0f}%)")
return passed, len(TEST_CASES)
if __name__ == "__main__":
run_eval()
Results
| Test | Result |
|---|---|
| Tool use implementation | PASS |
| Available models | PASS |
| Extended thinking | PASS (rerank fell back to vector similarity) |
| Client vs server tools | PASS |
| Few-shot examples | FAIL |
| Chain of thought | PASS |
| Be clear and direct | PASS |
| Computer use | PASS |
7/8 passing (88%). The one failure is a corpus gap: the
prompt-engineering/use-examples page wasn’t in the ingested URL list.
Adding it and re-ingesting would bring this to 8/8. The next post goes
deeper on why that distinction matters and how to write test cases that
catch it.
Streaming responses
One UI improvement worth calling out: replacing the full-response spinner with streaming token-by-token rendering.
Claude’s API supports streaming via client.messages.stream(). The FastAPI
endpoint sends a server-sent events (SSE) stream, and Streamlit renders each
token as it arrives:
# In rag.py
def query_stream(question: str):
chunks = retrieve(question)
prompt = build_prompt(question, chunks)
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
sources = [c["source_url"] for c in chunks]
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=1024,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
yield text
yield {"sources": sources, "chunks_used": len(chunks)}
The perceived latency difference is significant. A 10-second response that streams feels faster than a 3-second response that produces nothing then dumps everything at once. For a demo this matters.
The broader point
Retrieval quality is bounded by corpus quality.
The one eval failure wasn’t caused by a bad embedding model, poor chunking, or broken reranking. It was caused by a missing URL. No amount of tuning the retrieval pipeline would have fixed it — the page simply wasn’t there to retrieve.
In practice, the ceiling on answer quality is set at ingestion time. If the right content isn’t in your vector store, the retrieval can’t surface it. Improving retrieval quality and improving corpus quality are both necessary, and corpus quality comes first.
A related point: not every page on a docs site is useful for RAG.
Auto-generated API reference pages, changelogs, and heavily structured schema
tables tend to produce chunks that don’t retrieve well and consume embedding
budget without improving answer quality. The Anthropic docs’ /en/api/messages
page — 88 chunks of serialized schema content — was cut from this corpus for
exactly that reason. Removing it also eliminated the TPM cap problem during
ingestion: at ~39,000 words it would have blown Voyage AI’s 10,000-token-per-
minute free-tier budget in a single batch regardless of delay between calls.
Curating your URL list is part of corpus design, not just an ingestion detail.
Before tuning TOP_K, adjusting chunk size, or adding reranking, make sure the content you need is actually ingested — and that content you don’t need isn’t. An eval harness makes corpus gaps visible; without one, you’d tune retrieval parameters for hours trying to fix a problem that has nothing to do with retrieval.
Series navigation
Next: How to Design RAG Eval Test Cases
Source code
Full project: github.com/tylerwellss/rag-agent
Voyage AI docs: docs.voyageai.com
ChromaDB docs: docs.trychroma.com