Anatomy of Agent Orchestration

Over the last couple of years we've gone from simple chatbots to copilots embedded in products to full-blown ecosystems built around agents. Frameworks keep popping up not because people are bored, but because the community is genuinely trying to answer a hard question: How do we turn raw models into reliable, evolving systems?

LangGraph, AutoGen, AG2, CrewAI, LlamaIndex, n8n, LangChain, you name it. Each one is exploring a different angle: graphs, conversations, crews, indices, visual workflows. That's exciting, but it can also be confusing. When you're actually trying to ship something, it's hard to see past the diagrams and marketing pages. So instead of comparing feature lists, I wanted to zoom out and ask a simpler question: What are the actual primitives these frameworks are built on?

And once we know those primitives, how do the major frameworks line up against them? This post is my attempt at an autopsy of agent orchestration, not to dunk on anything, but to put the shared building blocks on the table so it's easier to reason about trade-offs.

The anchor use-case: an AI App Generator

To keep this concrete, I'm going to use one running example throughout:

AI Application Generator: an agentic system that takes a textual idea and produces a small web app end-to-end.

Roughly, the flow looks like this:

Requirements: Understand what needs to be built, constraints, stack preferences, non-goals.
Architecture: Break it down into components, data flow, APIs, contracts between frontend and backend.
Code generation (with ReAct loops): Generate backend + UI code; while coding, call tools to: run tests, check types, lint, inspect logs, re-consult the architecture.
Tests & Fixes: Run the app, run tests, capture failures, loop on fixes.
Packaging: Produce a repo / zip / deployment manifest + README.

Along the way we're maintaining some shared state:

reqs: user intent, constraints, functional and non-functional
arch: components, dataflow, interfaces
plan: ordered list of work items, budgets, stop conditions
code_artifacts: files, diffs, build logs
test_results: unit / e2e logs, failing steps
references: web / docs snippets, with citations
kb: a knowledge base / retriever over requirements + references (+ optional KG)
run_meta, run_id: checkpoints, budgets, trace IDs

Now, with that in the back of your mind, let's talk primitives.

The 8 real primitives of agentic systems

After reading docs, tutorials, and poking at production-ish systems, the same eight ideas kept showing up:

Orchestration & Coordination - Who does what, when, and with what state.
Agent loop (ReAct) - The inner loop: observe → think → act (often via tools) → observe → repeat.
Retrieval-Augmented Generation (RAG) - Getting the right external knowledge into the context window at the right time.
Memory model - Short-term scratchpad vs. long-term memory vs. structured knowledge graph.
Tools & registry - What actions agents can take (functions, APIs) and how those capabilities are exposed.
Durability & recoverability - Checkpointing, resuming, and not losing your mind (or state) when something crashes.
Observability & evaluation hooks - Being able to see what happened and decide if it was any good.
Connectors - How agents talk to the rest of your world: Slack, Git, Notion, webhooks, CRMs, etc.

Let's walk through each primitive, using the AI App Generator as our anchor, and see how LangGraph, AG2/AutoGen, and CrewAI express them.

A quick naming note: AG2 is the open-source evolution of the original AutoGen 0.2 project, while Microsoft's current AutoGen (v0.4 and beyond) continues as a separate actor-model framework. When I say "AG2/AutoGen" below, I mean this conversation-first, supervisor-driven style of multi-agent orchestration, not a single codebase.

1. Orchestration & Coordination

What this is → The control and collaboration plane of the system: it decides which step runs next and who owns it.

Why it matters → Agent systems are still workflows with varying autonomy. Underneath the vibes, there's a deterministic decision tree:

When do we move from requirements → architecture → code?
What happens if tests fail?
Who gets called when we need a re-plan?

Making that tree explicit is what separates "cute demo" from "debuggable system".

AI App Generator mapping

Phases: GatherRequirements → DraftArchitecture → ValidatePlan → GenerateBackend → GenerateUI → RunServerTests → RunUITests → FixLoop → Package.

Each handoff carries state (reqs, arch, code_artifacts, test_results) forward. We want clear checkpoints after architecture, backend, and tests.

LangGraph: graphs and typed state as the control plane

LangGraph is graph-first. You model the above flow as nodes and edges over an explicit State object.

from typing import TypedDict

class State(TypedDict):
    reqs: str
    arch: dict
    code_artifacts: dict
    test_results: dict

@node
def draft_arch(state: State) -> State:
    state["arch"] = llm_plan_architecture(state["reqs"])
    return state

@node
def run_server_tests(state: State) -> State:
    state["test_results"] = run_tests(state["code_artifacts"])
    return state

G = StateGraph(State)
G.add_node("DraftArchitecture", draft_arch)
G.add_node("RunServerTests", run_server_tests)
G.add_edge("GatherRequirements", "DraftArchitecture")
G.add_conditional_edges(
    "RunServerTests",
    lambda s: s["test_results"]["ok"],
    then="Package",
    else_="FixLoop",
)
app = G.compile(checkpoints=True)  # enables resume/HIL at node boundaries

The graph is the orchestration, the State is the collaboration medium.

AG2 / AutoGen: supervisor-driven messaging

AG2 (and AutoGen style) is conversation-first. You create agents and a manager that decides who speaks next.

manager = Agent(role="manager", rules=[
    "after requirements -> architect",
    "after architecture -> coder",
    "if tests fail -> send failing logs to coder",
    "if tests pass -> package",
])

architect = Agent(role="architect")
coder     = Agent(role="coder", tools=[write_file, run_tests])
tester    = Agent(role="tester", tools=[run_tests])

run(manager, participants=[architect, coder, tester], goal="build the app")

The "graph" now lives inside manager's policy. You get more fuzziness and emergent behavior, but less explicit structure.

CrewAI: roles + tasks + process

CrewAI lives somewhere in between: coordination is expressed via roles and a process made of tasks.

architect = Agent("Architect")
coder     = Agent("Coder")
tester    = Agent("Tester")

plan_arch  = Task("DraftArchitecture", agent=architect)
impl_code  = Task("ImplementBackend", agent=coder)
run_tests  = Task("RunServerTests", agent=tester)

process = Process([plan_arch, impl_code, run_tests])

Here the process object is effectively your orchestration graph; collaboration is framed in terms of who owns which task.

A spectrum: graphs → crews → conversations

In practice, I think of these frameworks as sitting on a spectrum of determinism vs. dynamic autonomy:

LangGraph: most pre-declared and deterministic. The graph is the source of truth, every node and edge is explicit, and you get strong replay, checkpoints, and human-in-the-loop. I'd reach for this in higher-risk flows where debuggability and auditability matter (internal tools, compliance-sensitive automations, anything around money).
CrewAI: a middle ground. You still give agents role prompts and tools, but the orchestration lives in a process made of tasks. It feels closer to a project plan: "Architect does this, then Coder, then Tester, with a critic in between". This maps nicely to customer service flows, research workflows, and other business pipelines.
AG2 / AutoGen: most dynamic and emergent. A supervisor (or group chat manager) routes between agents based on conversation and state, often with the LLM itself helping decide what to do next. That makes it great for code generation and debugging, root-cause analysis, and helpdesk-style problem solving.. anywhere you want more free-form reasoning and can tolerate a bit less determinism.

2. Agent loop (ReAct: perceive → decide → act → observe)

What this is → The inner loop inside a given phase:

Look at context (state, logs, tests)
Think
Pick a tool and act
Observe the result
Repeat or stop

Why it matters → This is where "intelligence and reasoning" often lives.

AI App Generator mapping

We use ReAct loops in:

GenerateBackend: incrementally write code, run tests, inspect failures, patch
GenerateUI: same story for UI

Tools: write_file, run_tests, npm_install, launch_server, browser_check, git_diff.

LangGraph: bounded loops inside nodes

One pattern is to keep the loop inside a node, and let the graph control whether we re-enter that node.

@node
def fix_loop(state: State) -> State:
    for step in range(8):  # budget
        patch = llm_propose_patch(state["test_results"], state["code_artifacts"])
        write_file(patch)
        state["test_results"] = run_tests(state["code_artifacts"])
        if state["test_results"]["ok"]:
            break
    return state

The outer graph decides whether to go back into FixLoop or move forward.

AG2 / AutoGen: conversational loops

In AG2/AutoGen, your Coder agent iterates, and a Tester agent reports status; a supervisor watches budgets.

budget = 8
while budget:
    msg = coder.step(messages)   # may call write_file/run_tests
    messages.append(msg)
    if tester.sees_pass(messages):
        break
    budget -= 1

Here the loop is an explicit control structure around the chat; in more advanced setups the supervisor might decide to break out based on test quality, cost, or time.

Or, in AutoGen's higher-level APIs:

assistant  = AssistantAgent(name="coder", tools=[file_writer, test_runner])
user_proxy = UserProxyAgent(name="user", human_input_mode="NEVER")

user_proxy.initiate_chat(
    assistant,
    message="Generate a working Python Flask app."
)

The ReAct behavior is baked into the assistant's reasoning.

CrewAI: loop as process + critic

CrewAI pushes loop control up into the process:

coder_task  = Task("FixLoop", tools=[write_file, run_tests], max_retries=8)
critic_task = Task("Critic")

process = Process([coder_task, critic_task])

Instead of an unbounded agent loop, you model "fix until critic approves or retries are exhausted".

3. Retrieval-Augmented Generation (RAG)

What this is → A pattern to ingest external knowledge → retrieve relevant chunks at runtime → ground model outputs with citations.

Why it matters → Most interesting apps don't live purely inside the model's weights. You want your AI App Generator to look at examples, docs, business constraints, etc., without copy-pasting them into every prompt.

Our placeholder → In code you'll see kb, kb_index, retriever, which are yours: a vector DB, SQL backend, LlamaIndex index, etc.

AI App Generator mapping

During GatherRequirements / DraftArchitecture, we index web examples and docs into kb.
During GenerateBackend / GenerateUI, we call retriever.query(...) to pull patterns and snippets.
We store these in references and include citations in the final README.

LangGraph: retrieval as nodes

RAG becomes just another couple of nodes:

@node
def index_references(state: State) -> State:
    state["kb_ids"] = kb_index.add(fetch_web_snippets(state["reqs"]))
    return state

@node
def retrieve_context(state: State) -> State:
    state["ctx"] = retriever.query(state["arch"], top_k=8)
    return state

Edges make sure retrieve_context runs before GenerateBackend.

AG2 / AutoGen: retrieval tools

In AG2/AutoGen, retrieval is a tool/skill:

def retrieve(q: str):
    return retriever.query(q, top_k=8)

coder.tools += [retrieve]

if supervisor.needs_context(messages):
    ctx = retrieve(query_from(arch))
    messages.append(
        Message(role="system", content=ctx)
    )

The supervisor (or agent policy) decides when to refresh context.

CrewAI: retrieval tasks

In CrewAI, you give retrieval tools to your research / coding tasks and enforce "retrieve before generate" in the process definition.

4. Memory model

What this is → How the system remembers:

Short-term: in-run scratchpad (state, messages)
Long-term: persistent knowledge (vector DB, KG)
Structured memory: entities/relationships (knowledge graph)

AI App Generator mapping

Short-term: reqs, arch, plan, test_results, diffs.
Long-term: references from previous runs or prior architectural decisions.
Optional KG: Component A depends_on Component B, Page X tested_by suite Y.

LangGraph: typed state + external stores

Short-term lives in a typed state:

class MemoryState(TypedDict):
    reqs: str
    arch: dict
    scratch: dict

@node
def load_long_term(state: MemoryState) -> MemoryState:
    docs = kb_retriever.query(state["reqs"], top_k=5)
    state["scratch"]["docs"] = docs
    return state

Long-term memory is whatever backends your tools call (vector DB, KG).

AG2 / AutoGen: conversation + memory tools

By default, your message history is memory. You can layer on tools:

memory = VectorStore(...)

@register_tool
def remember(key: str, value: str):
    memory.add(key, value)

@register_tool
def recall(key: str) -> str:
    return memory.get(key)

coder.tools += [remember, recall]

Now Coder can "remember" and "recall" across steps or even runs.

CrewAI: task-scoped + shared memory

CrewAI exposes both per-task context and shared crew memory. Conceptually: each task sees local context and can query/update a central store via tools (vector or KG).

5. Tools & registry

What this is → The verbs of your system. Tools are functions, APIs or actions the agent can call; the registry defines which agent can do what.

Why it matters → Tools are where things get real: file writes, HTTP calls, DB mutations, payment APIs. If you don't constrain and observe them, you're essentially giving the model root access to your life.

AI App Generator mapping

We'll keep using: web_search, write_file, read_file, run_tests, npm_install, launch_server, browser_check, git_diff, zip_export, retrieve, kg_update.

LangGraph: tools as nodes / callables

@tool
def run_tests(path: str) -> dict:
    return subprocess_run_tests(path)

builder.add_node("run_tests", run_tests)

Each node knows exactly which tools it can call. Very explicit.

AG2 / AutoGen: registered functions

@register_function
def web_search(query: str) -> str:
    ...

assistant = Agent(role="assistant", tools=[web_search])

Or, in AutoGen style, via Extensions:

search_ext = Extension(name="WebSearch", entrypoint=search_web)
assistant  = AssistantAgent(name="assistant", extensions=[search_ext])

You pass a list of tools/skills/extensions into the agent; the supervisor and prompts enforce how they're used.

CrewAI: tools per role/task

researcher  = Agent("Researcher", tools=[web_search])
writer_task = Task("WriteSpec", agent=researcher)

Tooling is attached to roles/tasks; your process defines where each capability can be exercised.

6. Durability & recoverability

What this is → Being able to pause, resume, rewind, or rerun a flow without losing your sanity (or your user's work).

AI App Generator mapping

We'd like to:

checkpoint after architecture is locked in,
after a stable backend build,
after each test batch,

and resume from there if something crashes or a human wants to intervene.

LangGraph: checkpoints baked in

app   = builder.compile(checkpoints=True)
state = app.invoke(initial_state)
# later…
resumed = app.resume(
    run_id="run-123",
    from_checkpoint="after_tests"
)

Durability is a first-class concern; you can also wire in HIL gates at specific nodes.

AG2 / AutoGen: logs + reconstruction

Pattern-wise, you often do:

log_store = {}

while True:
    msg = manager.step(messages)
    log_store[run_id].append(msg)
    messages.append(msg)
    if done(msg):
        break

# on failure, rebuild conversation from log_store[run_id]

Durability isn't "magic" here; you're building it with logs and replay logic.

CrewAI: process-level resume

result = crew.kickoff()
if result.failed_task:
    crew.resume(from_task=result.failed_task)

CrewAI recognises "long-running workflows that might fail mid-way" as a core use case. Checkpoints and retries live in the process definition.

7. Observability & evaluation hooks

What this is → The ability to see what happened, step by step, and judge whether it was good.

AI App Generator mapping

We care about:

per-phase spans (requirements, architecture, code, tests),
tool metrics (latency, failure rate),
quality (lint/test pass rate, diff size).

LangGraph: LangSmith + traces

with langsmith_trace("ai_app_generator") as span:
    result = app.invoke(initial_state)
    span.set_tag("tests_passed", result["test_results"]["ok"])

Because control flow is explicit, traces align nicely with graph topology.

AG2 / AutoGen: message-level logs

for msg in conversation:
    logger.info(f"{msg.sender}: {msg.content[:120]}")

Plus whatever observability infrastructure you plug underneath (OpenTelemetry, custom dashboards). AutoGen's Studio UI helps you visualise conversations and tool invocations.

CrewAI: task logs + critics

@critic
def review_output(output: str) -> bool:
    return passes_checks(output)

crew = Crew(agents=[coder, tester, critic], evaluators=[review_output])

CrewAI leans into critics and validation steps as explicit evaluation hooks, plus its own console for run-level introspection.

8. Connectors

What this is → How your agent world talks to the rest of your world: Slack, GitHub, Notion, CRMs, issue trackers, webhooks, etc.

AI App Generator mapping

Examples:

send build logs to Slack,
open a GitHub PR,
file a ticket when the generator fails,
push docs to Notion.

LangGraph: code-first, with Studio

@tool
def send_slack(message: str):
    slack_client.post(message)

builder.add_node("notify", send_slack)

You write the adapters yourself; Studio gives you a visual view of the graph, but the connectors are code.

AG2 / AutoGen: tools again

@app.route("/webhook", methods=["POST"])
def handle_webhook():
    payload = request.json
    user_proxy.initiate_chat(assistant, message=payload["summary"])

All integration is via Python.. fine if you're comfortable living in code, less friendly if you expect a "Zapier for agents" out of the box.

CrewAI: closer to Zapier land

slack_trigger = SlackTrigger(channel="#deployments")
crew          = Crew(agents=[pm, coder], triggers=[slack_trigger])

CrewAI ships with a growing set of ready-to-plug connectors (Slack, email, Notion, etc.) and a Studio UI that lets you wire these into your flows without rewriting glue each time.

Summary: LangGraph vs AG2 vs AutoGen vs CrewAI

This is where opinion kicks in:

If you want hard edges, replayable runs, and strong HIL, LangGraph is very comfortable.
If you want fuzzier, conversation-driven teams of agents, AG2 / AutoGen are a better fit.
If you want role-based flows with a friendly console + connectors, CrewAI hits a nice middle ground.

Where do LlamaIndex and n8n fit?

Two other names show up a lot in these conversations, but they're playing a slightly different game.

LlamaIndex: the RAG backend

LlamaIndex (formerly GPT Index) is: A specialized layer for data connectors, chunking, indexing, and retrieval.

It's not an orchestration framework. It's a knowledge substrate:

Load data from files, DBs, APIs, whatever
Chunk it, embed it, index it
Expose a query interface (query, as_retriever(), etc.)

In our AI App Generator, LlamaIndex would sit behind RAG:

LangGraph / AG2 / CrewAI orchestrate the flow
LlamaIndex powers the kb_index / retriever we've been pretending exists

That's a great separation of concerns: "let the RAG library worry about data; let the orchestration layer worry about flow."

n8n: the visual glue

n8n, on the other hand, is: A visual workflow engine with a ton of connectors and some AI nodes.

It's fantastic when you want to:

trigger flows on webhooks / cron / SaaS events,
stitch together 5–10 external systems,
drop an "OpenAI" node in the middle and call it a day.

n8n is not trying to be an agent framework; it's more like the no-code shell around everything else. In the AI App Generator world:

n8n might trigger the generator when a GitHub issue is created,
call into a LangGraph / CrewAI API to actually run the agentic workflow,
push results back to Slack / Jira / Notion.

You can think of it as the "outer orchestration" the thing that wraps your agent system into the rest of the business.

So… how should you choose?

My personal heuristic:

Start from your primitives, not the logo. What do you need: hard determinism and HIL? fuzzier collaboration? rich connectors? durable RAG?
If you know your decision tree and need replayable runs → Lean towards LangGraph (possibly with LlamaIndex underneath).
If you're exploring emergent multi-agent behaviors → Start with AG2/AutoGen and a clear supervisor policy.
If you want something that looks like a team of humans with a PM and Slack → Try CrewAI.
If your real pain is "how do I get all this data into my agents?" → Reach for LlamaIndex.
If your real pain is "how do I connect all these systems and triggers?" → Reach for n8n as the outer shell.

Once you start seeing frameworks in terms of these eight primitives, the landscape gets a lot less overwhelming. You stop asking: "Should I use LangGraph or AutoGen or CrewAI?" and start asking: "Where do I want determinism? Where do I want fuzziness? Where should memory live? Who owns the loop?"

That's the real autopsy. The rest is just skin.