Why Multi-Agent Systems Beat Single Agents for Complex Documents

I built a document analysis system for RFPs and contracts using multiple specialist LLM agents instead of one general-purpose prompt.

The architecture is simple:

PDF → text extraction
    → Technical Analyzer
    → Risk Analyzer
    → Cost Analyzer
    → Timeline Analyzer
    → Coordinator synthesis
    → final report

The interesting part is not that it calls an LLM. That’s easy. The interesting part is how much the output changes when the model is forced to analyze the same document through different lenses before producing a final answer.

This post covers why the multi-agent structure is worth the extra complexity.

The problem with one big prompt

The obvious first version is a single prompt:

Analyze this RFP. Identify technical issues, risks, costs, timeline concerns,
and recommended next steps. Return JSON.

This works in the same way most first-pass LLM apps work: it produces something that looks useful.

The problem is that the output blends everything together. A finding about SSO might be written as a technical issue, a security risk, a compliance issue, and a cost driver all in one paragraph. That sounds smart, but it is hard to use programmatically.

The model is doing all the reasoning at once:

extracting requirements
identifying risks
estimating cost exposure
judging delivery feasibility
prioritizing findings
formatting the final report

That’s too much responsibility in one pass.

The result is usually broad but shallow.

The specialist approach

Instead of one generalist prompt, the system uses specialist agents:

TechnicalAnalyzer → architecture, integrations, APIs, performance, security implementation
RiskAnalyzer      → legal, compliance, operational, vendor, delivery, security risk
CostAnalyzer      → licensing, infrastructure, migration, support, hidden costs
TimelineAnalyzer  → milestones, dependencies, schedule realism, delivery risk

Each agent receives the same document text. The difference is the system prompt and the output schema.

The technical agent is not asked to think about every risk. It is asked to find technical ambiguity and feasibility gaps.

The risk agent is not asked to price the system. It is asked to identify what could go wrong, how likely it is, and what the impact would be.

The cost agent is not asked to design the architecture. It is asked to identify cost drivers, hidden cost categories, and vendor questions.

Same document. Same model. Different reasoning lens.

The architecture

The first useful abstraction was a BaseAgent:

import json


def clean_json_response(text: str) -> str:
    text = text.strip()

    if text.startswith("```json"):
        text = text.removeprefix("```json").strip()

    if text.startswith("```"):
        text = text.removeprefix("```").strip()

    if text.endswith("```"):
        text = text.removesuffix("```").strip()

    return text


class BaseAgent:
    def __init__(self, client, model, system_prompt, max_tokens=1024):
        self.client = client
        self.model = model
        self.system_prompt = system_prompt
        self.max_tokens = max_tokens

    def analyze(self, text: str) -> dict:
        message = self.client.messages.create(
            model=self.model,
            max_tokens=self.max_tokens,
            system=self.system_prompt,
            messages=[
                {
                    "role": "user",
                    "content": text,
                }
            ],
        )

        response_text = message.content[0].text
        cleaned = clean_json_response(response_text)
        return json.loads(cleaned)

This class does not know anything about RFPs, risks, costs, or timelines. It only knows how to:

call the model
pass a system prompt
receive text
clean JSON
return a Python dict

That makes the specialist agents small.

from agents.base_agent import BaseAgent


TECHNICAL_ANALYZER_SYSTEM_PROMPT = """
You are a technical requirements analyst.

Analyze RFPs, contracts, and technical specifications.

Focus on:
- architecture ambiguity
- integration requirements
- API requirements
- performance and scalability
- security implementation details
- missing technical specifications

Do not analyze budget, delivery timeline, vendor/legal risk, or general project
management risk unless it directly affects technical feasibility.

Return JSON only in this format:
{
  "agent": "technical",
  "findings": [
    {
      "category": "",
      "description": "",
      "severity": "",
      "evidence": ""
    }
  ]
}
"""


class TechnicalAnalyzer(BaseAgent):
    def __init__(self, client, model="claude-sonnet-4-6"):
        super().__init__(
            client=client,
            model=model,
            system_prompt=TECHNICAL_ANALYZER_SYSTEM_PROMPT,
            max_tokens=2048,
        )

The RiskAnalyzer, CostAnalyzer, and TimelineAnalyzer follow the same pattern. The implementation is boring by design. The only thing that changes is the prompt and schema.

That is exactly what you want.

Why structured output matters

The first time I asked Claude for JSON, it returned this:

```json
{
  "requirements": [...]
}


That looks fine in a chat UI. It breaks `json.loads()`.

This is one of the annoying realities of building LLM applications: "return JSON only" does not mean the model will return parseable JSON every time. Sometimes it wraps the JSON in markdown fences. Sometimes it adds a sentence before the object. Sometimes it produces malformed JSON.

The cleaner above is not a complete solution, but it handles the common markdown wrapper failure mode.

For a real production system, I would add:
- Pydantic validation
- retry-on-invalid-JSON logic
- structured model outputs if the provider supports them
- schema-specific error handling

For this project, the lightweight cleaner was enough to keep moving.

---

## Single-agent vs multi-agent output

The single-agent version tends to produce one mixed list of findings:

```text
- The 6-month timeline is risky because of integrations, migration, compliance,
  and cost uncertainty.
- The budget may be too low because the technical requirements are complex.
- Security requirements are vague and create compliance risk.

That’s not wrong. It’s just not very structured.

The specialist version separates the same issue into different dimensions:

Technical:
- Integration requirements lack data sync, authentication, and error handling details.

Risk:
- Undefined integrations create high likelihood of delivery delays and post-award disputes.

Cost:
- Salesforce, Okta, and data warehouse integrations may require custom development,
  creating $30K-$100K+ in professional services exposure.

Timeline:
- Integration discovery and testing are not accounted for in the 6-month deployment plan.

That is more useful because the coordinator can now see the same underlying issue from multiple angles.

If three specialists independently flag integrations, that issue deserves to be elevated.

The coordinator makes the specialists useful

The specialist outputs are not the final product.

Without synthesis, the system just produces four reports. That’s better than one messy report, but still too much for a decision-maker.

The coordinator takes the specialist findings and produces the final report:

{
  "summary": {
    "overall_assessment": "",
    "top_concerns": [],
    "recommended_next_steps": []
  },
  "cross_domain_findings": [
    {
      "theme": "",
      "description": "",
      "related_agents": [],
      "severity": "",
      "action_required": ""
    }
  ],
  "agent_findings": {
    "technical": [],
    "risk": [],
    "cost": [],
    "timeline": []
  }
}

The coordinator’s job is not aggregation. Aggregation would be “here are all the findings.”

The coordinator’s job is synthesis: “these are the few issues that matter most, and here is what to do next.”

That distinction is the whole project.

What the system found on a sample RFP

I tested the system on a mock RFP for an enterprise document management system. The RFP included:

SSO with SAML or OAuth2
RBAC
full-text search
99.9% uptime
1 million documents
5,000 concurrent users
Salesforce, data warehouse, and Okta integrations
GDPR, SOX, and SOC 2 requirements
6-month deployment timeline
$500K-$1M budget

The coordinator elevated issues that crossed multiple domains:

1. The 6-month timeline is not realistic for the stated scope.
2. GDPR, US data residency, and long-term audit retention need legal clarification.
3. Security requirements are named but not specified.
4. Infrastructure requirements are not detailed enough to support the SLA.
5. Budget is not broken down enough to compare vendors.
6. Integration requirements are too vague for pricing or technical design.

The strongest finding was not any single agent’s output. It was the overlap.

Technical saw integration ambiguity. Risk saw delivery and vendor risk. Cost saw unpriced custom development. The coordinator turned that into a concrete recommendation: define integration scope before issuing the RFP.

That’s the value of the multi-agent pattern.

The trade-off

This architecture costs more than a single call.

A short document with four specialists plus coordinator synthesis is at least five LLM calls:

TechnicalAnalyzer
RiskAnalyzer
CostAnalyzer
TimelineAnalyzer
Coordinator

A long document can multiply that by the number of chunks.

The trade-off is worth it when:

the document is complex
multiple kinds of judgment are required
the output needs to support a decision
missing something has real consequences

It is probably overkill for:

short summaries
simple extraction
low-stakes documents
tasks where one schema is enough

This is the same pattern as most AI architecture decisions. More intelligence means more orchestration, more cost, and more failure modes. Use it when the extra structure buys you something.

The broader lesson

The value of a multi-agent system is not that it has multiple agents.

The value is that each agent has:

a narrow responsibility
a structured output
a different reasoning lens

Then a coordinator turns those perspectives into a decision.

That’s the pattern:

complex document → specialist analysis → cross-domain synthesis → action

Next: Building Specialist LLM Agents: Technical, Risk, Cost, and Timeline Analysis

The problem with one big prompt#

The specialist approach#

The architecture#

Why structured output matters#

The coordinator makes the specialists useful#

What the system found on a sample RFP#

The trade-off#

The broader lesson#

Series navigation#