What I Learned Building a Multi-Agent Document Analysis System

This is the retrospective for the multi-agent document analysis project.

The first posts covered:

why use multiple agents
how the specialist agents work
how the coordinator synthesizes findings

This one covers what worked, what broke, and what I would change.

In short, the architecture worked, the coordinator was the most valuable part, and chunking caused the worst failure mode.

What worked

The BaseAgent abstraction was enough. I did not need a framework. A simple base class handled the repeated LLM-call logic: model name, system prompt, max tokens, response cleaning, and JSON parsing.

That made each specialist mostly prompt design:

class RiskAnalyzer(BaseAgent):
    def __init__(self, client, model="claude-haiku-4-5"):
        super().__init__(
            client=client,
            model=model,
            system_prompt=RISK_ANALYZER_SYSTEM_PROMPT,
            max_tokens=2048,
        )

By the time I built the third specialist, the process felt routine. That is a good sign. The abstraction removed the mechanical work and left the interesting part: deciding what each specialist should care about.

Structured output made the system usable. Without JSON, this would just be a pile of markdown. With structured outputs, the coordinator can consume specialist findings consistently.

A risk finding has likelihood, impact, mitigation, and evidence. A cost finding has cost type, estimated impact, and vendor questions. A timeline finding has dependencies and schedule impact.

The schema shapes the reasoning.

The coordinator added real value. The specialist reports were useful, but too granular. The coordinator turned them into decision guidance.

The difference is obvious:

Without coordinator:
- 12 technical findings
- 14 risk findings
- 10 cost findings
- timeline findings

With coordinator:
- 6 top concerns
- overall assessment
- recommended next steps

That is the actual product.

The multi-agent pattern found cross-domain issues. The strongest findings were not isolated. They appeared across agents.

Integration requirements were a technical ambiguity, a delivery risk, a cost driver, and a timeline dependency. The coordinator elevated that into a top concern.

That’s the point of the architecture.

What didn’t work as well

The technical agent drifted into other domains. The first technical prompt was too broad. It started flagging budget and schedule issues. Those were useful findings, but they belonged to other agents.

The fix was adding negative instructions:

Do not analyze budget, delivery timeline, vendor/legal risk, or general project
management risk unless it directly affects technical feasibility.

Specialist prompts need boundaries. Positive instructions are not enough.

The coordinator was overconfident. The first synthesis output looked great but included unsupported benchmark claims and large cost ranges. It sounded like an expert report, which made it more dangerous.

The fix was prompt constraints:

Do not state specific dollar ranges or industry benchmarks unless directly
supported by input.

When uncertainty exists, label it as uncertainty instead of guessing.

This is the main risk with synthesis: small inferences can get amplified into confident recommendations.

Chunking broke the output. This was the most interesting failure.

I added chunking so the system could handle longer documents:

def chunk_text(text: str, chunk_size: int = 3000, overlap: int = 300) -> list[str]:
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap

    return chunks

Then I tested it on a short sample RFP.

Bad idea.

The document split into two chunks. One chunk ended mid-section. The agent treated the chunk boundary like the document was actually truncated. The coordinator then produced a top concern about the RFP being incomplete.

The RFP was not incomplete. The chunk was.

That is an important failure mode.

Chunking solves context limits, but it removes global context. If you chunk when you don’t need to, you can make the output worse.

The fix was simple:

def should_chunk(text: str, threshold: int = 20000) -> bool:
    return len(text) > threshold

Short documents go through the full-document path. Long documents use chunking.

def analyze(self, rfp_text: str) -> dict:
    if should_chunk(rfp_text):
        chunks = chunk_text(rfp_text, chunk_size=4000, overlap=300)
        agent_results = self.analyze_chunks(chunks)
    else:
        agent_results = self.analyze_full_document(rfp_text)

    summary = self.synthesize(
        agent_results["technical"],
        agent_results["risk"],
        agent_results["cost"],
        agent_results["timeline"],
    )

    return {
        "summary": summary,
        **agent_results,
    }

The rule is not sophisticated. It does not need to be yet.

The important lesson is that chunking is not automatically better. It is a fallback for large documents.

The biggest risk to output quality

There are four major quality risks:

A) agent prompts
B) coordinator prompt
C) chunking strategy
D) PDF extraction

For this project, the answer is split.

The most likely thing to break output quality is chunking strategy. It already caused the worst hallucination.

The most important thing for system value is the coordinator prompt. That is where raw findings become decision guidance.

The most reusable foundation is the BaseAgent + specialist pattern.

The weakest production dependency is probably PDF extraction. Text-based PDFs work. Scanned PDFs need OCR. Tables and figures are not handled well. For contracts and RFPs, that matters.

Cost and latency trade-offs

A single-agent system might require one LLM call.

This system requires at least:

TechnicalAnalyzer
RiskAnalyzer
CostAnalyzer
TimelineAnalyzer
Coordinator

Five calls for one short document.

For chunked documents, the cost scales quickly:

number_of_chunks × number_of_agents + coordinator

A 10-chunk document with four specialists is 41 calls.

That is why chunking decisions matter. It is also why parallel execution would be the first performance improvement.

Sequential execution is easier to debug:

Technical → Risk → Cost → Timeline → Coordinator

Parallel execution is better once the architecture is stable:

Technical ┐
Risk      ├── Coordinator
Cost      │
Timeline  ┘

The agents are independent, so the design supports parallelism later.

What I would build in v2

Pydantic validation. Every agent should have a schema. If the model omits evidence or returns an invalid severity level, the system should catch it immediately.

class RiskFinding(BaseModel):
    category: str
    description: str
    severity: Literal["low", "medium", "high", "critical"]
    likelihood: str
    impact: str
    evidence: str
    mitigation: str

Finding IDs and traceability. The coordinator should cite supporting findings by ID.

{
  "concern": "Integration specifications are insufficient",
  "supporting_findings": ["TECH-004", "RISK-005", "COST-004", "TIME-003"]
}

This would make the synthesis auditable.

Better chunking. Character-based chunking is a starting point, not a final solution. Section-aware chunking would be better for RFPs and contracts.

Instead of splitting every 4,000 characters, split on headings:

1. Overview
2. Functional Requirements
3. Security & Compliance
4. Timeline
5. Budget

That preserves document structure and reduces the “truncated section” problem.

Deduplication before synthesis. Chunked analysis can produce duplicate findings. The coordinator can filter them, but it is better to reduce noise before synthesis.

Parallel execution. The agents should run concurrently. This is an obvious latency win.

Evaluation harness. This is the big one. The system needs test RFPs with expected findings:

known technical gaps
known risk issues
known cost drivers
known timeline conflicts
expected coordinator concerns

Without this, prompt changes are subjective. With it, you can measure whether changes improve or degrade output quality.

When this architecture is worth it

This is overkill for simple summarization.

It is worth it when:

documents are complex
findings span multiple domains
users need recommendations, not just extraction
missing an issue has real consequences
the output needs to support a decision

RFPs and contracts fit that pattern.

A contract clause can be:

a legal risk
a cost exposure
an operational constraint
a timeline dependency

A single-agent prompt can notice that. A multi-agent system makes it explicit and structured.

The broader lesson

The main thing I learned: multi-agent systems are not about having more agents.

They are about structuring analysis.

The useful pattern is:

document
→ specialized perspectives
→ structured findings
→ coordinator synthesis
→ decision guidance

The agents produce observations. The coordinator produces judgment.

That is the difference between a demo and something that feels like a real tool.

Source code

Full project: github.com/tylerwellss/multi-agent-doc-analysis

Previous: Coordinating Multiple LLM Agents: Cross-Domain Synthesis

Series index:

What worked#

What didn’t work as well#

The biggest risk to output quality#

Cost and latency trade-offs#

What I would build in v2#

When this architecture is worth it#

The broader lesson#

Source code#

Series navigation#