Python

What I Learned Building a Multi-Agent Document Analysis System

This is the retrospective for the multi-agent document analysis project. The first posts covered: why use multiple agents how the specialist agents work how the coordinator synthesizes findings This one covers what worked, what broke, and what I would change. In short, the architecture worked, the coordinator was the most valuable part, and chunking caused the worst failure mode. What worked The BaseAgent abstraction was enough. I did not need a framework. A simple base class handled the repeated LLM-call logic: model name, system prompt, max tokens, response cleaning, and JSON parsing. ...

Coordinating Multiple LLM Agents: Cross-Domain Synthesis

After building the specialist agents, the output looked impressive. It was not useful enough. The system produced: 12 technical findings 14 risk findings 10 cost findings timeline findings That is a lot of analysis. It is also a lot to read. The coordinator is the piece that turns those separate findings into something a person can act on. Aggregation is not synthesis The first version of the coordinator just ran the agents and returned their results. ...

Building Specialist LLM Agents: Technical, Risk, Cost, and Timeline Analysis

The first post covered why I split document analysis into multiple agents. This one covers how the specialists are actually built. The Python code is not the hard part. The specialist behavior mostly comes from: the system prompt the output schema the boundaries around what the agent should ignore The code is intentionally repetitive. Once you’ve written a couple agents, it’s a breeze. The shared base class Every agent needs the same basic execution logic: ...

Why Multi-Agent Systems Beat Single Agents for Complex Documents

I built a document analysis system for RFPs and contracts using multiple specialist LLM agents instead of one general-purpose prompt. The architecture is simple: PDF → text extraction → Technical Analyzer → Risk Analyzer → Cost Analyzer → Timeline Analyzer → Coordinator synthesis → final report The interesting part is not that it calls an LLM. That’s easy. The interesting part is how much the output changes when the model is forced to analyze the same document through different lenses before producing a final answer. ...

Building the Catalog and Ingestion Pipeline: Archetypes, Embeddings, and ChromaDB

The first post covered architecture. Here the focus shifts to data: how to generate a realistic product catalog at scale, why description quality matters for RAG, and how the ingestion pipeline embeds everything into ChromaDB. The pipeline produced 1180 products with rich descriptions, embedded them in 39 seconds, and returned retrieval results that actually held up. The archetype strategy Writing 1180 product descriptions by hand is infeasible. Having Claude write them one-by-one is slow and produces inconsistent output. The solution: archetype-based generation. ...

Building AI Search for a Retail Website: The Stack and Why

I built Ozark Ridge, a mock outdoor gear retail site with AI-powered product search and a Rufus-style product assistant. The project exists to demonstrate RAG (Retrieval-Augmented Generation) in a realistic e-commerce context. This is the first post in a series documenting the build. This one covers the architecture and stack decisions. Later posts cover the RAG pipeline, keyword vs semantic search comparison, and building the AI assistant. What it does Two features: ...

What I Learned Building a LangGraph Agent From Scratch

I wanted to understand what it actually takes to build something that makes real decisions. So I built a job research agent using LangGraph: give it a company name, it autonomously gathers information from multiple sources, evaluates whether it has enough to work with, and loops back if it doesn’t. This post is about what that process taught me about state, nodes, and conditional nodes. The Problem With Linear Pipelines A typical “agent” pattern looks like this: ...

Your MCP Server Is Only as Good as Its Docstrings

I built a college football data MCP server that connects Claude to CollegeFootballData.com, a free API with deep historical stats, advanced metrics, recruiting data, and play-by-play going back decades. Its data goes beyond what frontier AI models are trained on. Getting it working was straightforward — there’s a gofastmcp.com tutorial for that. Getting Claude to use it well required understanding something that’s easy to overlook: the key interface between an LLM and a tool is the docstring. ...

Scoring RAG Answer Quality with an LLM Judge

The previous post in this series built an eval harness that scores retrieval quality: does the right documentation page appear in the retrieved chunks? 7/8 passing, 88%. A useful signal. But retrieval quality and answer quality are different things. A test can pass retrieval scoring and still produce a bad answer. A test can fail retrieval scoring and still produce a correct one. Source URL retrieval is a proxy — a fast, cheap proxy that catches a lot of problems, but not all of them. ...

How to Design RAG Eval Test Cases

A working RAG pipeline is easy. Knowing whether it will keep working after you change something is harder, and most projects skip that part entirely. Here the focus is designing an eval harness that catches real problems, using the Anthropic docs RAG agent as the example. What an eval harness does An eval harness is a script that runs a fixed set of test cases against your pipeline and produces a pass/fail score. Run it before and after a change — if the score drops, the change broke something. If it improves, the change helped. ...