Operating a 34-Agent Organization: Cost, Coordination, and Safety Patterns from 16 Days of Production Data
Authors: MSR Research — Quantum (AI Optimization), Nebula (Data Science), Docsmith (Documentation) Date: March 2026 Version: 1.0 — Draft Category: Research Paper PRD: `prds/2026-03-14-1600_empirical-multi-agent-cost-quality.prd.md`1. Introduction
The concept of multi-agent AI systems — where multiple specialized language model instances collaborate on complex tasks — has generated substantial interest in both research and industry (Park et al., 2023; Wu et al., 2023). However, most published work focuses on simulation environments, benchmark tasks, or architectural proposals. Operational data from production multi-agent deployments is rare.
This paper addresses that gap by reporting telemetry from a production multi-agent system that has been operating since February 2026. The system comprises 34 named agents organized into 6 functional teams, processing real workloads across intelligence gathering, content generation, inter-agent coordination, and governance.
We make no claims about this system being optimal or generalizable. Our contribution is the data itself: what does it actually cost to run a multi-agent organization? How do agents coordinate? What does governance look like in practice? What does safety infrastructure report when it's running in production?
1.1 Research Questions
- RQ1: What does it cost to operate a multi-agent organization, and how is cost distributed across models, agents, and pipelines?
- RQ2: How do agents coordinate in production, and what communication patterns emerge?
- RQ3: How does governance function, and what proportion of decisions require human oversight?
- RQ4: What does safety infrastructure report in a production multi-agent system?
1.2 Scope and Limitations
This study has inherent limitations that readers should consider before interpreting results:
- n=1: We report on a single organization. Patterns may not generalize.
- 16-day window: The observation period (February 27 – March 14, 2026) is short.
- No baseline: We have no "non-agent" comparison for the same workloads.
- Self-report: We built the system we are measuring, introducing potential bias in what is logged and how it is interpreted.
- Incomplete coverage: Cost data covers only API-logged calls. Claude Code CLI sessions, manual operations, and development costs are not captured.
2. System Description
2.1 Agent Architecture
The system comprises 34 named agents organized into 6 teams:
| Team | Agents | Focus |
|---|---|---|
| Development | 11 | Frontend, backend, database, DevOps, QA, security, integration, docs, AI optimization, data science, ML |
| Grants | 8 | Research, writing, compliance, budget, impact, communications, analytics, marketing |
| Executive | 2 | CEO advisory, CTO advisory |
| Product | 4 | Product management, scrum, UX research, policy advisory |
| Coordination | 2 | Orchestration, technology scouting |
| Stories | 7 | Editorial leadership, news, beat reporting, copy editing, production, circulation |
Each agent has a system prompt loaded from a skill file, defined competencies, and explicit handoff rules specifying which agents can receive its output.
2.2 Model Tiers
During the observation period, the system used three Claude model tiers:
| Tier | Model ID | Input $/1M tokens | Output $/1M tokens | Primary Use |
|---|---|---|---|---|
| Cost-optimized | claude-haiku-4-5-20251001 | $0.80 | $4.00 | High-volume intelligence pipelines |
| Balanced | claude-sonnet-4-20250514 | $3.00 | $15.00 | Analysis, extraction, agent execution |
| Narrative | claude-sonnet-4-5-20250929 | $3.00 | $15.00 | Story generation (daily) |
Additionally, Voyage AI (voyage-3-lite) was used for embedding generation, and Firecrawl for web scraping. Model selection during this period was static per service — not dynamically routed per request.
2.3 Data Collection
All data was collected through production logging, not instrumented for this study:
- API cost log (`api_cost_log`): Every LLM API call logs model, agent, product, input/output tokens, and estimated cost. 1,516 records.
- Agent decision log (`agent_decision_log`): Governance decisions (approve/deny) with agent identity, tier, and reason. 308 records.
- Agent messages (`agent_messages`): Inter-agent directives with sender, receiver, status, and timestamps. 190+ completed messages.
- Knowledge base (`kb_artifacts` + `kb_embeddings`): Research artifacts with type classification and 512-dimensional vector embeddings. 604 records.
- Safety infrastructure (`agent_circuit_breakers` + `agent_directive_scans`): Circuit breaker trips and directive scan flags. 0 records each.
2.4 Methodology
All analyses use SQL queries against the production Supabase (PostgreSQL) database. Queries are reproduced in the Appendix for replication. Cost estimates use the token pricing in Table 2.2. No statistical modeling is applied — we report descriptive statistics only, consistent with the exploratory nature of this work.
3. Cost Analysis (RQ1)
3.1 Aggregate Cost
Over 16 days of operation, total API cost was $58.76, comprising:
| Category | Records | Cost | % of Total |
|---|---|---|---|
| Claude LLM calls | 1,091 | $58.49 | 99.5% |
| Firecrawl scraping | 192 | $0.19 | 0.3% |
| SendGrid email delivery | 75 | $0.08 | 0.1% |
| Voyage AI embeddings | 158 | $0.00 | <0.1% |
| Total | 1,516 | $58.76 | 100% |
Daily average: $3.67/day. Token volume: 14.9 million input tokens and 920,899 output tokens.
3.2 Cost by Model Tier
| Model | Calls | Total Cost | Avg Cost/Call | Avg Input Tokens | Avg Output Tokens | Output/Input Ratio |
|---|---|---|---|---|---|---|
| Haiku | 582 | $12.05 | $0.021 | 3,257 | 729 | 0.224 |
| Sonnet | 495 | $46.00 | $0.093 | 26,228 | 949 | 0.036 |
| Sonnet 4.5 | 14 | $0.44 | $0.031 | 1,074 | 1,899 | 1.768 |
The output/input ratio reveals distinct usage patterns:
- Haiku (0.224): Reads 4.5x more than it writes — used for extraction and classification tasks
- Sonnet (0.036): Reads 28x more than it writes — used for analysis of large context windows
- Sonnet 4.5 (1.768): Writes more than it reads — used for narrative generation (daily stories)
3.3 Cost by Product Pipeline
| Pipeline | Model | Calls | Cost | Avg Input | Avg Output |
|---|---|---|---|---|---|
| General analysis (null) | Sonnet | 329 | $43.17 | 37,744 | 1,198 |
| AI Education | Haiku | 382 | $8.20 | 3,458 | 740 |
| ANO Research | Haiku | 98 | $2.62 | 4,749 | 834 |
| Research Tunnels | Sonnet | 74 | $1.58 | 4,633 | 501 |
| Tech Scout | Haiku | 96 | $1.20 | 1,120 | 606 |
| Agent Messaging | Sonnet | 68 | $0.99 | 2,633 | 440 |
| Stories | Sonnet 4.5 | 14 | $0.44 | 1,074 | 1,899 |
| ANO Research | Sonnet | 14 | $0.15 | 2,206 | 280 |
| Tech Scout | Sonnet | 10 | $0.11 | 1,220 | 483 |
The three intelligence pipelines (AI Education, ANO Research, Tech Scout) are all cost-optimized on Haiku, totaling $12.02 for 576 calls — an average of $0.021 per call.
3.4 Daily Cost Patterns
| Day | Haiku Calls | Sonnet Calls | Total Cost | Notes |
|---|---|---|---|---|
| Mar 3 | 0 | 213 | $27.41 | Bulk analysis run — single-day spike |
| Mar 10 | 218 | 35 | $5.43 | High Haiku volume (intelligence collection) |
| Mar 13 | 214 | 29 | $5.75 | High Haiku volume (intelligence collection) |
| Typical day | 20–22 | 13–25 | $0.50–$3.75 | Steady-state operation |
The March 3 spike ($27.41, 46.6% of total 16-day cost) was caused by a bulk analysis batch — 213 Sonnet calls in a single day. Excluding this outlier, the remaining 15 days averaged $2.09/day.
3.5 Cost Efficiency: Haiku vs Sonnet
For pipelines that use both models, we can compare efficiency:
| Pipeline | Haiku Cost/Call | Sonnet Cost/Call | Sonnet Premium |
|---|---|---|---|
| ANO Research | $0.027 | $0.011 | 0.4x (Sonnet cheaper here) |
| Tech Scout | $0.013 | $0.011 | 0.9x (comparable) |
The Sonnet calls in these pipelines have lower input token counts than the general analysis calls, making the per-call cost comparable to Haiku. The cost advantage of Haiku is most pronounced in high-volume, high-context workloads.
4. Coordination Patterns (RQ2)
4.1 Inter-Agent Messaging
Over the observation period, 190+ inter-agent messages were completed. The top communication pairs:
| Sender | Receiver | Messages | Period |
|---|---|---|---|
| Lumen (ops bot) | Helio (orchestrator) | 22 | Mar 2–11 |
| Lumen | Byte (backend) | 17 | Feb 27–Mar 11 |
| Helio | Quest (QA) | 12 | Mar 9–14 |
| Helio | Quantum (optimizer) | 12 | Mar 9–14 |
| Lumen | Atlas (CEO advisor) | 11 | Mar 2–11 |
| Lumen | Forge (DevOps) | 11 | Feb 27–Mar 5 |
| Lumen | Iris (marketing) | 11 | Mar 2–10 |
| Lumen | Nova (grant writer) | 10 | Feb 27–Mar 9 |
| Lumen | Pixel (frontend) | 8 | Mar 2–11 |
| Email Router | Helio | 7 | Mar 14 |
| Claude Code | Helio | 6 | Mar 2–8 |
4.2 Hub-and-Spoke Topology
The messaging data reveals a hub-and-spoke pattern with two hubs:
1. Lumen (operations bot): The primary message sender, dispatching directives to 15+ agents across all teams. Lumen serves as the human-to-agent interface — translating operator instructions into inter-agent directives.
2. Helio (orchestrator): The primary message receiver (35 inbound messages from Lumen, Claude Code, and Email Router) and the primary downstream dispatcher (routing to Quest and Quantum for QA and optimization tasks).
No agent-to-agent communication was observed that bypassed both hubs. This suggests the current system operates as a mediated coordination model rather than a peer-to-peer mesh — all coordination flows through designated coordination points.
4.3 Team Communication Patterns
| Team | Inbound Messages | Outbound Messages | Most Active Agent |
|---|---|---|---|
| Development | 47 | 0 | Byte (17 received) |
| Grants | 23 | 0 | Nova (10 received) |
| Coordination | 35 | 24 | Helio (35 received, 24 sent) |
| Executive | 11 | 0 | Atlas (11 received) |
| Stories | 0 | 0 | — |
| Product | 0 | 0 | — |
5. Governance (RQ3)
5.1 Decision Distribution
The agent decision log recorded 308 governance decisions:
| Decision Maker | Decision | Tier | Count | % |
|---|---|---|---|---|
| Agent Executor Worker | Approve | 1 (auto) | 170 | 55.2% |
| Helio (orchestrator) | Approve | 1 (auto) | 36 | 11.7% |
| Helio | Approve | 2 (peer) | 9 | 2.9% |
| 24 individual agents | Approve | 1 (auto) | 78 | 25.3% |
| 7 agents | Approve | 2 (peer) | 21 | 6.8% |
| Atlas, Apex, Helio | Approve | 3 (committee) | 6 | 1.9% |
| Agent Executor Worker | Deny | 1 (auto) | 3 | 1.0% |
5.2 Approval Tier Analysis
| Tier | Name | Count | % | Meaning |
|---|---|---|---|---|
| 1 | Auto-approve | 284 | 92.2% | Agent had sufficient trust score for automatic approval |
| 2 | Peer review | 30 | 9.7% | Required review by another agent before proceeding |
| 3 | Committee | 6 | 1.9% | Required review by multiple senior agents |
| 4 | Human | 0 | 0% | Would require human approval — not triggered |
The 3 denials were all made by the executor worker at Tier 1, suggesting they were capability-based rejections (agent lacked required capabilities) rather than trust-based denials.
6. Safety Infrastructure (RQ4)
6.1 Circuit Breakers
The circuit breaker system monitors inter-agent message frequency and trips when any agent pair exchanges more than 5 messages within a 30-minute window.
Result: Zero trips over the entire observation period.6.2 Directive Scanners
The directive scanner checks every inter-agent message for four categories of potential prompt injection: base64-encoded payloads, instruction override attempts, encoded commands, and role injection patterns.
Result: Zero flags over the entire observation period.6.3 Interpretation
Zero safety events can mean:
1. The system is safe: No runaway loops or injection attempts occurred, and the safety infrastructure would have caught them if they had.
2. The volume is too low: With ~190 inter-agent messages over 16 days (~12/day), the system may not have reached the traffic levels where safety failures typically emerge.
3. The thresholds are too permissive: A 5-message/30-minute circuit breaker threshold may be too high to detect slow-building coordination failures.
We cannot determine which interpretation is correct from this data. A meaningful safety analysis would require either (a) adversarial testing (deliberately attempting to trigger failures) or (b) substantially higher message volumes over a longer period.
What we can report: The safety infrastructure adds approximately 10ms of overhead per message (directive scanning). Over 190 messages, this represents ~1.9 seconds of total compute — negligible operational cost.7. Intelligence Lake
7.1 Artifact Volume
The knowledge base contained 604 artifacts as of March 14, 2026, collected from February 15 onward (28 days of collection).
| Artifact Type | Count | % |
|---|---|---|
| Fact | 308 | 51.0% |
| Entity | 137 | 22.7% |
| Statistic | 75 | 12.4% |
| Legislation | 49 | 8.1% |
| Discovery | 20 | 3.3% |
| Quote | 15 | 2.5% |
7.2 Embedding Coverage
All 604 artifacts have corresponding 512-dimensional vector embeddings (voyage-3-lite), providing 100% semantic search coverage. The embedding cost for the entire knowledge base was under $0.01 (158 Voyage API calls at $0.02/1M tokens).
7.3 Collection Rate
Average collection rate: 21.6 artifacts/day over 28 days. This rate is driven by automated research tunnel schedules (Tuesday/Friday for AI education; daily for political discourse and ANO research).
8. Limitations
We reiterate and expand on the limitations stated in Section 1.2:
1. Single organization (n=1): All data comes from one system with one operator. We cannot claim these patterns generalize to other multi-agent deployments.
2. Short observation window: 16 days of cost data and 28 days of knowledge base collection. Seasonal patterns, long-term cost trends, and drift effects are not observable.
3. No baseline comparison: We have no data on what these workloads would cost with a single model, a human team, or a different multi-agent architecture. Without a baseline, we cannot claim the multi-agent approach is more or less efficient than alternatives.
4. Incomplete cost capture: The $58.76 total excludes Claude Code CLI sessions (used extensively for development), Stripe processing fees, infrastructure costs (server, domain, Supabase), and human operator time. The true cost of operating this system is substantially higher.
5. Static model routing: During this period, model assignment was fixed per service — not dynamically optimized. The cost data represents an unoptimized baseline. (Dynamic model routing was deployed on March 14, after the observation period ended.)
6. Low inter-agent volume: ~190 messages over 16 days (~12/day) is modest. Safety infrastructure conclusions are tentative at this volume.
7. Self-reporting bias: We designed, built, and now report on this system. We have attempted to report data without editorializing, but readers should apply appropriate skepticism to any interpretation.
9. Discussion
9.1 Cost Observations
The most notable cost finding is the concentration: one pipeline category ("general analysis" — Curmudgeon persona analysis with large context windows) accounts for 73.8% of all Claude cost. This suggests that cost optimization efforts should focus on the highest-context workloads first, not the highest-volume ones. The intelligence pipelines (AI Education, Tech Scout, ANO Research) handle the majority of calls but contribute only 20.6% of cost because they use Haiku with small context windows.
At $3.67/day ($2.09/day excluding the March 3 outlier), the operational cost for a 34-agent system is modest. For context, a single developer hour at $75/hr costs more than an entire day of multi-agent operations. However, this comparison is misleading — the agents are not replacing developer hours one-for-one, and the cost excludes infrastructure and human oversight.
9.2 Coordination Observations
The hub-and-spoke coordination pattern (Lumen → Helio → downstream agents) was not designed — it emerged from how operators interact with the system. Operators message Lumen; Lumen routes to Helio; Helio dispatches to specialists. No peer-to-peer agent coordination was observed.
This raises a question for future work: is hub-and-spoke optimal, or does it create a coordination bottleneck? With only ~12 messages/day, bottleneck effects are not observable. At higher volumes, direct agent-to-agent communication might reduce latency.
9.3 Governance Observations
The 92.2% auto-approval rate at Tier 1 is consistent with a system handling routine, low-risk workloads. The absence of Tier 4 (human approval) events is notable but not necessarily positive — it may indicate the system has not yet been asked to make high-stakes decisions that would benefit from human oversight.
9.4 What This Data Does Not Show
This paper does not demonstrate that multi-agent organizations are better, cheaper, faster, or safer than alternatives. It shows what one such system looks like in production. The value of this data is as a reference point — a set of concrete numbers that future work can compare against, improve upon, or challenge.
10. Conclusion
We presented 16 days of production telemetry from a 34-agent multi-agent system. The system operated at $3.67/day average cost, with 78.6% of LLM cost concentrated in a single model tier (Sonnet) and 73.8% in a single pipeline category (large-context analysis). Inter-agent coordination followed a hub-and-spoke pattern through a central orchestrator, with no observed peer-to-peer communication. Governance auto-approved 92.2% of decisions with zero human escalations. Safety infrastructure recorded zero events, which we attribute to low message volume rather than proven robustness.
These findings establish a baseline for a production multi-agent system. The data invites several next steps: dynamic model routing to reduce the Sonnet cost concentration, adversarial testing of safety infrastructure at higher volumes, controlled comparisons against non-agent workflows, and longer observation periods to detect drift and seasonal patterns.
We publish this data not as evidence of a thesis, but as a measurement. Multi-agent systems in production are still rare enough that even basic operational data has reference value.
Appendix: Reproducible Queries
All analyses in this paper can be reproduced with the following SQL queries against a Supabase PostgreSQL database with the schema described in Section 2.3.
A.1 Aggregate Cost (Section 3.1)
SELECT
count(*) as total_cost_records,
count(DISTINCT model_name) as distinct_models,
count(DISTINCT agent_name) as distinct_agents,
count(DISTINCT product_slug) as distinct_products,
count(DISTINCT created_at::date) as distinct_days,
min(created_at::date) as first_day,
max(created_at::date) as last_day,
round(sum(estimated_cost)::numeric, 2) as total_cost,
sum(input_tokens) as total_input_tokens,
sum(output_tokens) as total_output_tokens
FROM api_cost_log;
A.2 Cost by Model Tier (Section 3.2)
SELECT model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as total_cost,
round(avg(estimated_cost)::numeric, 4) as avg_cost_per_call,
round(avg(input_tokens)::numeric, 0) as avg_input,
round(avg(output_tokens)::numeric, 0) as avg_output,
round((avg(output_tokens)::numeric /
NULLIF(avg(input_tokens)::numeric, 0)), 3) as output_input_ratio
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY model_name ORDER BY calls DESC;
A.3 Cost by Product Pipeline (Section 3.3)
SELECT product_slug, model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as cost,
round(avg(input_tokens)::numeric, 0) as avg_input,
round(avg(output_tokens)::numeric, 0) as avg_output
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY product_slug, model_name
ORDER BY cost DESC;
A.4 Daily Cost (Section 3.4)
SELECT created_at::date as day, model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as cost
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY day, model_name
ORDER BY day, model_name;
A.5 Inter-Agent Messages (Section 4.1)
SELECT from_agent, to_agent, count(*) as messages,
min(created_at::date) as earliest,
max(created_at::date) as latest
FROM agent_messages
WHERE status = 'completed'
GROUP BY from_agent, to_agent
ORDER BY messages DESC;
A.6 Governance Decisions (Section 5.1)
SELECT decided_by, decision_type, tier, count(*) as cnt
FROM agent_decision_log
GROUP BY decided_by, decision_type, tier
ORDER BY cnt DESC;
A.7 Intelligence Lake (Section 7.1)
SELECT artifact_type, count(*) as cnt
FROM kb_artifacts
GROUP BY artifact_type ORDER BY cnt DESC;
A.8 Safety Infrastructure (Section 6)
SELECT count(*) FROM agent_circuit_breakers;
SELECT count(*) FROM agent_directive_scans;
References
1. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). "Generative agents: Interactive simulacra of human behavior." UIST 2023.
2. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., ... & Wang, C. (2023). "AutoGen: Enabling next-gen LLM applications via multi-agent conversation." arXiv:2308.08155.
3. MSR Research (2026). Internal production database. Supabase PostgreSQL, project ID evvhjowqkqiwxfxtmpav.
This paper reports production data from a single organization over a short observation window. It does not claim that multi-agent organizations are superior to alternatives. The data is published as a reference point for the emerging field of production multi-agent systems.