Creating a Multi-Agent System with Python and LangChain

I had a pipeline that kept eating itself.

The task was straightforward: research current trends in vector database pricing, then write a short summary. One agent, a handful of tools, a careful system prompt. It started well — pulling results, skimming content, taking notes. Then halfway through composing a paragraph, something it wrote made it doubt its own data. So it searched again. Then wrote some more. Then re-searched because a sentence it drafted referenced a number it couldn't confirm.

Four minutes and $0.80 in API calls later, I had output that mixed raw search snippets with half-composed paragraphs. Not a bug — just the expected behavior of one model trying to do two incompatible jobs at once.

You'll need Python 3.10+, an Anthropic API key, a Tavily API key, and langchain, langchain-anthropic, langchain-community, and tavily-python.

The Architecture Before You Write a Line of Code

Split reasoning from retrieval.

The simplest useful multi-agent setup has three components: a Researcher that searches the web and returns structured findings, a Writer that takes those findings and produces prose, and a Supervisor function that coordinates the handoff between them. No agent needs to know the others exist.

Search and synthesis are genuinely different cognitive tasks. Search rewards breadth and tolerance for noisy, conflicting results. Synthesis rewards focus, consistency of voice, and the ability to ignore irrelevant detail. When one agent switches between those modes mid-task, the context window fills with partial work in both states — and the output is often partial in both senses.

There's a practical argument for separation too. A Researcher that's pulled 2,000 words of raw search results across four queries has a very different context window state than a Writer starting fresh from a structured brief. Keeping them separate keeps each agent's context clean and predictable.

When Does a Single Agent Fall Apart?

Three specific failure modes worth naming.

Context drift. When an agent is searching, writing, and re-evaluating all in the same window, it loses track of which phase it's in. I've watched agents treat their own intermediate output as freshly retrieved research — re-reading something they wrote two steps ago and citing it as a source. It looks like thoroughness. It's circular.

Tool interference. A writing agent that has access to a search tool will use it, even mid-paragraph. The model doesn't know that pulling three more results mid-sentence will dilute the synthesis it was building. An available action is an available action.

Ask a single agent to produce "research notes and then a summary" and you'll often get something that's technically both but practically neither — bullet fragments mixed with prose, neither format clean enough to pass downstream. Call it output format instability. The label usually comes after you've been burned by it.

None of these are bugs.

They're the expected behavior of a capable model doing two incompatible things at the same time.

Set Up Your Agents

pip install langchain langchain-anthropic langchain-community tavily-python

Export both API keys before running:

export ANTHROPIC_API_KEY="your-anthropic-key"
export TAVILY_API_KEY="your-tavily-key"

The Researcher uses create_tool_calling_agent, which is the right choice for Claude — it uses the native tool-calling API rather than the text-based ReAct format. The Writer doesn't need an agent wrapper at all. It's a direct LLM call with a focused system prompt:

from langchain_anthropic import ChatAnthropic
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatAnthropic(model="claude-sonnet-4-6")  # or claude-opus-4-8 for complex tasks

search_tool = TavilySearchResults(max_results=4)

researcher_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a research specialist. Use the search tool to gather accurate, "
        "up-to-date information on the topic you're given. Return structured findings: "
        "key facts, recent developments, relevant context, and any notable caveats. "
        "Return organized notes only — not prose."
    ),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

researcher_agent = create_tool_calling_agent(llm, tools=[search_tool], prompt=researcher_prompt)
researcher = AgentExecutor(
    agent=researcher_agent,
    tools=[search_tool],
    handle_parsing_errors=True,
)

WRITER_SYSTEM = (
    "You are a technical writer. You receive structured research findings and "
    "synthesize them into clear, accurate prose. Write in a direct, informative style. "
    "You have no search tools — work only with what you're given. "
    "If the research contains gaps, note them briefly rather than filling them in with assumptions."
)

The Researcher is an agent because it needs to decide when and how to call the search tool — it might call it once or four times depending on what the topic requires. The Writer doesn't need that flexibility. It takes structured text in and produces prose out. There's no reason to wrap a straight LLM call in agent overhead just because everything else is called an agent.

Wire Up the Supervisor

The supervisor isn't an agent. It's a function that knows the pipeline:

from typing import TypedDict

class PipelineState(TypedDict):
    topic: str
    research: str
    final_output: str

def run_pipeline(topic: str) -> PipelineState:
    state: PipelineState = {"topic": topic, "research": "", "final_output": ""}

    # Researcher gathers findings
    result = researcher.invoke({
        "input": (
            f"Research this topic and return structured findings:\n{topic}\n\n"
            "Include key facts, recent developments, context, and any significant caveats."
        )
    })
    state["research"] = result["output"]

    # Writer synthesizes — direct LLM call, no agent overhead
    response = llm.invoke([
        SystemMessage(content=WRITER_SYSTEM),
        HumanMessage(content=(
            f"Write a clear, 400-word summary based on these findings:\n\n"
            f"{state['research']}\n\n"
            f"Topic: {topic}. Audience: technical developers."
        )),
    ])
    state["final_output"] = response.content

    return state

The PipelineState TypedDict is optional but useful. It makes the intermediate research output independently inspectable — you can log it, store it, run a quality check against it, or pass it to additional downstream agents without tangling the function logic. It also means if the Writer step fails, you can resume from a checkpoint without re-running the Researcher.

One thing that's genuinely fiddly about this pattern is conditional routing. If the Researcher comes back with thin results — two sources, heavy caveats — there's no built-in mechanism to re-run the search with a different query before handing off to the Writer. You'd add that as an if-branch in the supervisor function. For simple cases that works fine. For more complex pipelines where you want graph-based conditional routing between agents, that's when LangGraph earns its abstraction cost — but it is a real cost, not a free upgrade.

Run It End-to-End

if __name__ == "__main__":
    topic = "Current pricing models for managed vector database services"

    result = run_pipeline(topic)

    print("=== Research Findings ===")
    print(result["research"])
    print("\n=== Final Summary ===")
    print(result["final_output"])

On a typical run, the Researcher calls the search tool two to four times, processes 800–1,500 words of raw results into structured notes, and returns 300–500 words of organized findings. The Writer then synthesizes those in a single pass — no second-guessing, no re-searches.

Total latency is 15–25 seconds on claude-sonnet-4-6 with a live search step. Too slow for interactive use. But that's not the point — this pattern belongs in background jobs, scheduled pipelines, and async workflows where the user triggers a task and comes back to a finished result.

The quality improvement over a single-agent approach shows up most clearly in debuggability. When something goes wrong, the separate state makes it obvious whether the failure was in the research or the synthesis. With a combined single-agent pipeline, those failure modes are tangled in one context window and much harder to separate.

Wrapping Up

Where it gets genuinely hard is feedback loops — a fact-checker that routes questionable claims back through the Researcher, or a critic agent that scores output quality before the result is returned. I don't have a clean answer for when that complexity pays off. LangGraph handles it well, but you're now thinking in terms of state machines, conditional graph edges, and typed state schemas — real design overhead.

For pipelines that are actually linear — research, then write, then done — the sequential approach above is easier to build, easier to debug, and easier to extend incrementally. Start there. Add the graph abstraction when you hit the specific limits of a sequential design, not as the default opening move.