How I Built a Multi-Agent Orchestration System from Scratch

Technical product management · System design · AI

A senior TPM’s step-by-step account of architecting and shipping an end-to-end agentic platform — every layer, every trade-off, every decision made

The Problem This System Solves

There is a specific moment in most AI product explorations where the demo works and the real product doesn’t.

The demo has one agent, one task, one happy path. The real product has ambiguous inputs, multiple simultaneous work streams, costs that compound quietly, and humans who need to stay in control without being bottlenecked.

I have sat in that gap many times. Over several months, I designed and built the system that closes it — an orchestration platform that takes a natural language request in Slack, decomposes it into parallel workstreams, executes them on cloud compute, and delivers a structured deliverable with full cost transparency.

Step 0: Choosing the Right Mental Model

Before writing a single line of code, I had to decide how to think about the system. This turned out to be the most consequential decision of the entire project.

The model I rejected

The intuitive model for agentic AI is a single powerful agent that does everything — you give it a goal, it figures out the steps, and it executes. This is the demo model. It fails in production for four predictable reasons:

The model I chose

I modelled the system after how a consulting firm operates:

A senior partner (Master Agent) handles intake, clarification, planning, and coordination. They never do deep execution work themselves.
Specialists (Build Master + sub-agents) receive scoped briefs, execute independently, and return structured outputs.
The senior partner stitches the outputs into a final deliverable.

This maps to a concrete architectural split:

Step 1: Settings Management & Configuration Layer

Input: Raw project requirements

Output: A centralised, environment-aware configuration system

Principle: Configuration lives in data, not in code

What I built

A settings layer that loads configuration from environment variables and `.env` files, validates required values at startup, and exposes them through a typed interface. Every downstream component reads from this layer — no component ever reads an environment variable directly.

Design decision

I chose to fail fast on missing configuration. If a required value is absent, the system refuses to start rather than discovering the gap mid-execution when a sub-agent tries to call a provider it has no key for. This is a deliberate trade-off: it makes first-time setup harder but eliminates an entire class of runtime failures.

Step 2: Secrets Handling

Input: Configuration layer from Step 1

Output: Secure secrets management with rotation support

Dependency: Step 1

What I built

A secrets layer that separates sensitive credentials from general configuration. API keys, database passwords, and tokens are stored and accessed through a dedicated path — never logged, never serialised to state, never included in LLM context.

Why is this its own step

It is tempting to fold secrets into the general configuration. The reason not to: agentic systems pass state between agents, serialise it to databases, and include it in prompts. A general configuration value that happens to contain an API key will eventually leak into an LLM call, a log file, or a Supabase record. Treating secrets as architecturally distinct — with explicit access patterns and automatic redaction — prevents this by design rather than by discipline.

Step 3: Structured Logging & Database Layer

Input: Steps 1–2

Output: Observable system internals + persistent state

Key infrastructure: Supabase (Postgres), SQLite, structured JSON logging

What I built

Two things in one step, because they are co-dependent:

Structured logging — Every significant system event (agent start, LLM call, tool invocation, state transition, error) is logged as structured JSON with consistent fields: timestamp, component, event type, correlation ID, and relevant metadata. No `print()` statements, no unstructured strings.

Database layer — Three-tier state persistence:

Layer	Role	Durability	Failure Mode
Supabase (Postgres)	System of record	Durable	If this is lost, data is gone
SQLite (local)	Session cache	Ephemeral	Can be wiped and rebuilt from Supabase
Redis (Upstash)	Message bus	Transient	In-memory fallback keeps system running

Why three layers

Because the failure modes are different and the access patterns are different.

Supabase handles writes that must survive: task records, agent runs, artifacts, cost ledgers, approval states. Every write is the authoritative record.
SQLite handles reads that must be fast: the Master Agent queries current task state dozens of times per orchestration cycle. Hitting Postgres over the network for each read adds latency that compounds across a multi-step pipeline.
Redis handles communication that must be real-time: the Master Agent and Build Master exchange progress updates, status signals, and approval requests through pub/sub channels.

Step 4: The Model Router

Input: Steps 1–3

Output: Unified LLM interface with tier selection, fallback, cost tracking, and budget enforcement

This is the single most critical piece of infrastructure in the entire system.

What I built

An abstraction layer that sits between every agent and every LLM provider. From the caller’s perspective, making an LLM call looks identical regardless of the underlying provider:

“`python

response = router.call(

tier=”reasoning”, # or “execution” or “utility”

messages=messages,

task_id=task_id

)

“`

The router handles everything else:

Tier-based model selection — Maps the requested tier to the appropriate model
Provider fallback chains — If Anthropic is down, fall through to OpenAI, then Gemini
Cost recording — Every call’s token count and cost is written to the ledger
Health monitoring — Tracks error rates per provider, temporarily excludes unhealthy ones
Budget enforcement — Rejects calls that would exceed the task’s budget ceiling

The three-tier model

This is one of the highest-leverage cost decisions in the system:

Tier	Used For	Model Class	Relative Cost
Reasoning	Planning, decomposition, stitching, quality judgment	Claude Opus / GPT-4o / Gemini Pro	1× (baseline)
Execution	Module-level work — drafting, analysis, code generation	Claude Sonnet / GPT-4o-mini	0.2×–0.4×
Utility	Classification, summarisation, routing decisions	Haiku / GPT-3.5 / Gemini Flash	0.02×–0.05×

The economics are stark: Running everything at reasoning tier costs 10–50× more for the same output quality on execution-tier tasks. Running everything at utility tier produces planning that is structurally unsound. Matching the model to the cognitive demand of the task is not a minor optimisation — it is the difference between a system that costs $2 per task and one that costs $40.

Step 5: The Master Agent (Local Orchestrator)

Input: Steps 1–4

Output: Slack-integrated agent that handles intake, clarification, decomposition, and coordination

Framework: LangGraph

What I built

The Master Agent is a LangGraph state machine that runs locally and manages the full lifecycle of a task:

Step 6: The Persona Template System

Input: Step 5

Output: YAML-based task templates that make the system generalisable without code changes

Principle: Configuration as data, not code

What I built

A library of persona YAML files, each defining the decomposition pattern for a category of work:

“`yaml

# persona: patent_research.yaml

name: Patent Research & Analysis

classification_keywords:

– patent

– prior art

– claims

– novelty

modules:

– id: prior_art_search

tier: execution

depends_on: []

output_type: research_notes

– id: novelty_analysis

tier: reasoning

depends_on: [prior_art_search]

output_type: analysis_document

– id: claims_drafting

tier: reasoning

depends_on: [novelty_analysis]

output_type: structured_document

quality_checklist:

– All cited patents include publication numbers

– Novelty analysis addresses at least 3 closest prior art references

– Claims follow standard patent claim structure

“`

How classification works

When a new task arrives:

Keyword matching — Fast, cheap, deterministic. Checks the task description against each persona’s `classification_keywords`.
LLM classification (fallback) — If keywords are ambiguous or absent, a utility-tier model classifies the task against persona descriptions.
Template adaptation— The matched template provides the starting structure. A reasoning-tier model then adapts it to the specifics of the actual task — adding modules, removing irrelevant ones, adjusting dependencies.

Step 7: The Handoff Infrastructure

Input: Steps 5–6 (approved task spec from Master Agent)

Output: A running execution environment on cloud compute

Key infrastructure: GitHub API, Colab integration, Redis pub/sub

What happens when a task is approved

Four things execute in sequence:

CREATE GITHUB REPO

└─ Task-specific repository

└─ Contains: task spec, relevant source code, persona template

GENERATE JUPYTER NOTEBOOK

└─ Self-contained execution notebook

└─ Includes: Build Master code, sub-agent definitions, config

PUSH TO REPO

└─ Notebook + all task assets committed

DELIVER LAUNCH URL

└─ User receives Colab link in Slack

└─ System subscribes to Redis channel for this task

The GitHub-per-task model

This is more overhead than a shared repository. I chose it deliberately because:

Audit trail — Every task has a complete, isolated history of what was generated and when

Artifact linkage — When the Build Master pushes outputs back, they are permanently connected to the task that produced them

Execution isolation — No risk of one task’s state polluting another’s

Reproducibility — Any task can be re-run from its repo without reconstructing the context

Step 8: The Build Master (Remote Orchestrator)

Input: Task spec + repo from Step 7
Output: Orchestrated parallel execution of all modules
Runs on: Google Colab (GPU-enabled)
Framework: LangGraph

What I built

The Build Master is the remote counterpart to the Master Agent. It receives the approved task spec and executes it:

Parse the decomposition — Read the module list, dependencies, and tier assignments
Compute execution waves — Group modules by dependency level
Spawn sub-agents — One per module, each scoped to its specific brief
Monitor progress — Track completion, detect failures, trigger retries
Publish updates — Send real-time progress to the Master Agent via Redis

Parallel wave execution

Wave 1: Modules A & B — no dependencies, run in parallel
Wave 2: Modules C, D, E, F & G — depend on Wave 1, run in parallel
Wave 3: Module L — depends on all previous modules, runs last

Wall-clock time is determined by the critical path, not the total module count.** Twelve modules that would take 60 minutes sequentially might be completed in 15 minutes across three waves. This is the same principle as a well-structured engineering sprint — parallelism is the multiplier, and the dependency graph determines its ceiling.

Step 9: Sub-Agent Execution

Input: Module brief from Build Master

Output: Structured module artifact

Model tier: Execution (by default, overridable per module)

How a sub-agent works

Each sub-agent is a focused executor. It receives:

– A scoped brief — exactly what this module must produce

– Context — outputs from upstream modules it depends on

– Output schema— the expected structure of its deliverable

– Budget ceiling — maximum spend for this module

It executes, produces a structured artifact, and returns it to the Build Master. It does not know about other modules. It does not make strategic decisions. It does the work it was assigned.

Research agent escalation

When a sub-agent encounters something it cannot resolve — a factual question it is uncertain about, a domain-specific requirement it does not understand — it signals confusion to the Build Master. The Build Master spawns a **research agent**: a lightweight, targeted information-gatherer that:

Formulates specific search queries
Gathers relevant information
Returns structured research notes to the confused sub-agent
The sub-agent retries with the additional context

This is cheaper and more reliable than giving every sub-agent full research capabilities. Most modules do not need research. The ones that do get a specialist.

Step 10: Real-Time Communication Layer

Input: Steps 8–9
Output: Bidirectional communication between local and remote processes
Infrastructure: Redis pub/sub (Upstash), in-memory fallback

What flows through the bus

Build Master → Master Agent:

– Module completion events

– Progress percentages

– Cost updates (per module, cumulative)

– Error signals

– Research agent spawn notifications

Master Agent → Build Master:

– Approval responses

– Modification requests

– Abort signals

– Budget adjustments

Graceful degradation

If Redis becomes unavailable mid-execution:

The system detects the connection loss
Falls back to an in-memory event bus
Buffers messages locally
Retries Redis connection on a backoff schedule
Replays buffered messages when the connection is restored

The execution does not stop. The user temporarily loses real-time Slack updates, but the work continues.

Step 11: Cost Tracking & Budget Enforcement

Input: All previous steps (cross-cutting concern)

Output: Per-call, per-module, per-task cost transparency
Written to: Supabase cost ledger

Every LLM call passes through the model router (Step 4). The router records:

– Provider and model used

– Input and output token counts

– Calculated cost (using per-model pricing tables)

– Task ID and module ID for attribution

– Timestamp

This produces a complete cost ledger that answers: *How much did this task cost? Which module was most expensive? Which provider? Which tier?*

Budget enforcement

Each task has a budget ceiling set during approval (Step 5). The router checks cumulative spend before every call. If the next call would exceed the budget:

The call is rejected
The sub-agent receives a budget-exceeded signal
The Build Master is notified
The user is notified in Slack with current spend breakdown
The user can approve additional budget or abort

Without per-task budgets and per-call enforcement, costs are invisible until they are catastrophic.

Step 12: Content Approval & Human-in-the-Loop

Input: Module outputs from Step 9
Output: Human-approved deliverables
Surface: Slack

The approval flow

At configurable checkpoints — typically after the Build Master stitches module outputs into a draft deliverable — the system pauses and presents the output for human review:

Draft deliverable is posted to Slack (formatted, with section links)
User reviews and responds with: approve, revise (with notes), or reject
If revised: the relevant modules re-execute with the revision notes as additional context
If approved: the deliverable is finalised and pushed to the task’s GitHub repo

Step 13: Final Assembly & Delivery

Input: Approved module outputs

Output: Structured deliverable, permanently archived
Model tier: Reasoning (for stitching and quality assessment)

What happens at the end

Stitching — A reasoning-tier model takes all approved module outputs and composes them into a coherent final deliverable, resolving cross-references, ensuring consistent terminology, and filling gaps
Quality check — The deliverable is evaluated against the persona’s quality checklist (Step 6)
Artifact push — Final deliverable is committed to the task’s GitHub repo
Cost summary — Complete cost breakdown is generated and posted to Slack
State closure — Task record in Supabase is marked complete with all metadata

The Core Insight

Autonomous AI systems are not hard because the AI is hard. The AI part is the easy part — call an API, get a response.

They are hard because they require the same engineering rigour as any distributed system:

Careful state management across multiple persistence layers
Explicit failure modes with designed degradation paths
Observable internals that let you understand what the system is actually doing
Principled trade-offs between cost, speed, and quality — made deliberately, not discovered after the fact

The architecture I built is not the only valid one for this problem. But every decision in it was made with a clear understanding of what it optimised for and what it sacrificed.