All Research Papers
preprint
March 2026

Operating a 34-Agent Organization: Cost, Coordination, and Safety Patterns from 16 Days of Production Data

MSR Research — Quantum, Nebula, Docsmith
Multi-Agent SystemsProduction DataCost AnalysisAgent CoordinationAI SafetyEmpirical

Abstract

Sixteen days of production telemetry from a 34-agent multi-agent system. Reports on 1,516 API cost records ($58.76 total, $3.67/day), 190+ inter-agent messages, 308 governance decisions (92.2% auto-approved), and a 604-artifact knowledge base. Finds cost concentrated in a single model tier (78.6%) and pipeline (73.8%), hub-and-spoke coordination through a central orchestrator, and zero safety infrastructure events. Includes reproducible SQL queries, limitations section, and no marketing claims.

Operating a 34-Agent Organization: Cost, Coordination, and Safety Patterns from 16 Days of Production Data

Authors: MSR Research — Quantum (AI Optimization), Nebula (Data Science), Docsmith (Documentation) Date: March 2026 Version: 1.0 — Draft Category: Research Paper PRD: `prds/2026-03-14-1600_empirical-multi-agent-cost-quality.prd.md`

1. Introduction

The concept of multi-agent AI systems — where multiple specialized language model instances collaborate on complex tasks — has generated substantial interest in both research and industry (Park et al., 2023; Wu et al., 2023). However, most published work focuses on simulation environments, benchmark tasks, or architectural proposals. Operational data from production multi-agent deployments is rare.

This paper addresses that gap by reporting telemetry from a production multi-agent system that has been operating since February 2026. The system comprises 34 named agents organized into 6 functional teams, processing real workloads across intelligence gathering, content generation, inter-agent coordination, and governance.

We make no claims about this system being optimal or generalizable. Our contribution is the data itself: what does it actually cost to run a multi-agent organization? How do agents coordinate? What does governance look like in practice? What does safety infrastructure report when it's running in production?

1.1 Research Questions

- RQ1: What does it cost to operate a multi-agent organization, and how is cost distributed across models, agents, and pipelines?

- RQ2: How do agents coordinate in production, and what communication patterns emerge?

- RQ3: How does governance function, and what proportion of decisions require human oversight?

- RQ4: What does safety infrastructure report in a production multi-agent system?

1.2 Scope and Limitations

This study has inherent limitations that readers should consider before interpreting results:

- n=1: We report on a single organization. Patterns may not generalize.

- 16-day window: The observation period (February 27 – March 14, 2026) is short.

- No baseline: We have no "non-agent" comparison for the same workloads.

- Self-report: We built the system we are measuring, introducing potential bias in what is logged and how it is interpreted.

- Incomplete coverage: Cost data covers only API-logged calls. Claude Code CLI sessions, manual operations, and development costs are not captured.


2. System Description

2.1 Agent Architecture

The system comprises 34 named agents organized into 6 teams:

TeamAgentsFocus
Development11Frontend, backend, database, DevOps, QA, security, integration, docs, AI optimization, data science, ML
Grants8Research, writing, compliance, budget, impact, communications, analytics, marketing
Executive2CEO advisory, CTO advisory
Product4Product management, scrum, UX research, policy advisory
Coordination2Orchestration, technology scouting
Stories7Editorial leadership, news, beat reporting, copy editing, production, circulation

Each agent has a system prompt loaded from a skill file, defined competencies, and explicit handoff rules specifying which agents can receive its output.

2.2 Model Tiers

During the observation period, the system used three Claude model tiers:

TierModel IDInput $/1M tokensOutput $/1M tokensPrimary Use
Cost-optimizedclaude-haiku-4-5-20251001$0.80$4.00High-volume intelligence pipelines
Balancedclaude-sonnet-4-20250514$3.00$15.00Analysis, extraction, agent execution
Narrativeclaude-sonnet-4-5-20250929$3.00$15.00Story generation (daily)

Additionally, Voyage AI (voyage-3-lite) was used for embedding generation, and Firecrawl for web scraping. Model selection during this period was static per service — not dynamically routed per request.

2.3 Data Collection

All data was collected through production logging, not instrumented for this study:

- API cost log (`api_cost_log`): Every LLM API call logs model, agent, product, input/output tokens, and estimated cost. 1,516 records.

- Agent decision log (`agent_decision_log`): Governance decisions (approve/deny) with agent identity, tier, and reason. 308 records.

- Agent messages (`agent_messages`): Inter-agent directives with sender, receiver, status, and timestamps. 190+ completed messages.

- Knowledge base (`kb_artifacts` + `kb_embeddings`): Research artifacts with type classification and 512-dimensional vector embeddings. 604 records.

- Safety infrastructure (`agent_circuit_breakers` + `agent_directive_scans`): Circuit breaker trips and directive scan flags. 0 records each.

2.4 Methodology

All analyses use SQL queries against the production Supabase (PostgreSQL) database. Queries are reproduced in the Appendix for replication. Cost estimates use the token pricing in Table 2.2. No statistical modeling is applied — we report descriptive statistics only, consistent with the exploratory nature of this work.


3. Cost Analysis (RQ1)

3.1 Aggregate Cost

Over 16 days of operation, total API cost was $58.76, comprising:

CategoryRecordsCost% of Total
Claude LLM calls1,091$58.4999.5%
Firecrawl scraping192$0.190.3%
SendGrid email delivery75$0.080.1%
Voyage AI embeddings158$0.00<0.1%
Total1,516$58.76100%

Daily average: $3.67/day. Token volume: 14.9 million input tokens and 920,899 output tokens.

3.2 Cost by Model Tier

ModelCallsTotal CostAvg Cost/CallAvg Input TokensAvg Output TokensOutput/Input Ratio
Haiku582$12.05$0.0213,2577290.224
Sonnet495$46.00$0.09326,2289490.036
Sonnet 4.514$0.44$0.0311,0741,8991.768
Key finding: Sonnet accounts for 78.6% of Claude cost ($46.00) despite representing only 45.4% of calls (495/1,091). Haiku handles 53.3% of calls but only 20.6% of cost. The 4.4x cost multiplier per call (Sonnet vs Haiku) compounds with Sonnet's 8x higher average input token count.

The output/input ratio reveals distinct usage patterns:

- Haiku (0.224): Reads 4.5x more than it writes — used for extraction and classification tasks

- Sonnet (0.036): Reads 28x more than it writes — used for analysis of large context windows

- Sonnet 4.5 (1.768): Writes more than it reads — used for narrative generation (daily stories)

3.3 Cost by Product Pipeline

PipelineModelCallsCostAvg InputAvg Output
General analysis (null)Sonnet329$43.1737,7441,198
AI EducationHaiku382$8.203,458740
ANO ResearchHaiku98$2.624,749834
Research TunnelsSonnet74$1.584,633501
Tech ScoutHaiku96$1.201,120606
Agent MessagingSonnet68$0.992,633440
StoriesSonnet 4.514$0.441,0741,899
ANO ResearchSonnet14$0.152,206280
Tech ScoutSonnet10$0.111,220483
Key finding: "General analysis" (329 Sonnet calls, $43.17) accounts for 73.8% of all Claude cost. These are analysis calls with large context windows (avg 37,744 input tokens) — primarily Curmudgeon persona analysis and report generation. This single category is the dominant cost driver.

The three intelligence pipelines (AI Education, ANO Research, Tech Scout) are all cost-optimized on Haiku, totaling $12.02 for 576 calls — an average of $0.021 per call.

3.4 Daily Cost Patterns

DayHaiku CallsSonnet CallsTotal CostNotes
Mar 30213$27.41Bulk analysis run — single-day spike
Mar 1021835$5.43High Haiku volume (intelligence collection)
Mar 1321429$5.75High Haiku volume (intelligence collection)
Typical day20–2213–25$0.50–$3.75Steady-state operation

The March 3 spike ($27.41, 46.6% of total 16-day cost) was caused by a bulk analysis batch — 213 Sonnet calls in a single day. Excluding this outlier, the remaining 15 days averaged $2.09/day.

3.5 Cost Efficiency: Haiku vs Sonnet

For pipelines that use both models, we can compare efficiency:

PipelineHaiku Cost/CallSonnet Cost/CallSonnet Premium
ANO Research$0.027$0.0110.4x (Sonnet cheaper here)
Tech Scout$0.013$0.0110.9x (comparable)

The Sonnet calls in these pipelines have lower input token counts than the general analysis calls, making the per-call cost comparable to Haiku. The cost advantage of Haiku is most pronounced in high-volume, high-context workloads.


4. Coordination Patterns (RQ2)

4.1 Inter-Agent Messaging

Over the observation period, 190+ inter-agent messages were completed. The top communication pairs:

SenderReceiverMessagesPeriod
Lumen (ops bot)Helio (orchestrator)22Mar 2–11
LumenByte (backend)17Feb 27–Mar 11
HelioQuest (QA)12Mar 9–14
HelioQuantum (optimizer)12Mar 9–14
LumenAtlas (CEO advisor)11Mar 2–11
LumenForge (DevOps)11Feb 27–Mar 5
LumenIris (marketing)11Mar 2–10
LumenNova (grant writer)10Feb 27–Mar 9
LumenPixel (frontend)8Mar 2–11
Email RouterHelio7Mar 14
Claude CodeHelio6Mar 2–8

4.2 Hub-and-Spoke Topology

The messaging data reveals a hub-and-spoke pattern with two hubs:

1. Lumen (operations bot): The primary message sender, dispatching directives to 15+ agents across all teams. Lumen serves as the human-to-agent interface — translating operator instructions into inter-agent directives.

2. Helio (orchestrator): The primary message receiver (35 inbound messages from Lumen, Claude Code, and Email Router) and the primary downstream dispatcher (routing to Quest and Quantum for QA and optimization tasks).

No agent-to-agent communication was observed that bypassed both hubs. This suggests the current system operates as a mediated coordination model rather than a peer-to-peer mesh — all coordination flows through designated coordination points.

4.3 Team Communication Patterns

TeamInbound MessagesOutbound MessagesMost Active Agent
Development470Byte (17 received)
Grants230Nova (10 received)
Coordination3524Helio (35 received, 24 sent)
Executive110Atlas (11 received)
Stories00
Product00
Key finding: Only the Coordination team (Helio) sends messages to other agents. All other teams are message receivers only. The Stories and Product teams received zero inter-agent messages during this period — their work was triggered by scheduled pipelines (systemd timers) rather than inter-agent communication.

5. Governance (RQ3)

5.1 Decision Distribution

The agent decision log recorded 308 governance decisions:

Decision MakerDecisionTierCount%
Agent Executor WorkerApprove1 (auto)17055.2%
Helio (orchestrator)Approve1 (auto)3611.7%
HelioApprove2 (peer)92.9%
24 individual agentsApprove1 (auto)7825.3%
7 agentsApprove2 (peer)216.8%
Atlas, Apex, HelioApprove3 (committee)61.9%
Agent Executor WorkerDeny1 (auto)31.0%

5.2 Approval Tier Analysis

TierNameCount%Meaning
1Auto-approve28492.2%Agent had sufficient trust score for automatic approval
2Peer review309.7%Required review by another agent before proceeding
3Committee61.9%Required review by multiple senior agents
4Human00%Would require human approval — not triggered
Key finding: 92.2% of decisions were auto-approved (Tier 1), with only 3 denials (1.0%) across the entire period. No decision reached Tier 4 (human approval). This suggests either (a) the trust scoring system is well-calibrated, routing low-risk decisions to auto-approve, or (b) the system is not yet handling workloads that trigger higher-tier governance. With only 308 decisions over 16 days, we cannot distinguish between these interpretations.

The 3 denials were all made by the executor worker at Tier 1, suggesting they were capability-based rejections (agent lacked required capabilities) rather than trust-based denials.


6. Safety Infrastructure (RQ4)

6.1 Circuit Breakers

The circuit breaker system monitors inter-agent message frequency and trips when any agent pair exchanges more than 5 messages within a 30-minute window.

Result: Zero trips over the entire observation period.

6.2 Directive Scanners

The directive scanner checks every inter-agent message for four categories of potential prompt injection: base64-encoded payloads, instruction override attempts, encoded commands, and role injection patterns.

Result: Zero flags over the entire observation period.

6.3 Interpretation

Zero safety events can mean:

1. The system is safe: No runaway loops or injection attempts occurred, and the safety infrastructure would have caught them if they had.

2. The volume is too low: With ~190 inter-agent messages over 16 days (~12/day), the system may not have reached the traffic levels where safety failures typically emerge.

3. The thresholds are too permissive: A 5-message/30-minute circuit breaker threshold may be too high to detect slow-building coordination failures.

We cannot determine which interpretation is correct from this data. A meaningful safety analysis would require either (a) adversarial testing (deliberately attempting to trigger failures) or (b) substantially higher message volumes over a longer period.

What we can report: The safety infrastructure adds approximately 10ms of overhead per message (directive scanning). Over 190 messages, this represents ~1.9 seconds of total compute — negligible operational cost.

7. Intelligence Lake

7.1 Artifact Volume

The knowledge base contained 604 artifacts as of March 14, 2026, collected from February 15 onward (28 days of collection).

Artifact TypeCount%
Fact30851.0%
Entity13722.7%
Statistic7512.4%
Legislation498.1%
Discovery203.3%
Quote152.5%

7.2 Embedding Coverage

All 604 artifacts have corresponding 512-dimensional vector embeddings (voyage-3-lite), providing 100% semantic search coverage. The embedding cost for the entire knowledge base was under $0.01 (158 Voyage API calls at $0.02/1M tokens).

7.3 Collection Rate

Average collection rate: 21.6 artifacts/day over 28 days. This rate is driven by automated research tunnel schedules (Tuesday/Friday for AI education; daily for political discourse and ANO research).


8. Limitations

We reiterate and expand on the limitations stated in Section 1.2:

1. Single organization (n=1): All data comes from one system with one operator. We cannot claim these patterns generalize to other multi-agent deployments.

2. Short observation window: 16 days of cost data and 28 days of knowledge base collection. Seasonal patterns, long-term cost trends, and drift effects are not observable.

3. No baseline comparison: We have no data on what these workloads would cost with a single model, a human team, or a different multi-agent architecture. Without a baseline, we cannot claim the multi-agent approach is more or less efficient than alternatives.

4. Incomplete cost capture: The $58.76 total excludes Claude Code CLI sessions (used extensively for development), Stripe processing fees, infrastructure costs (server, domain, Supabase), and human operator time. The true cost of operating this system is substantially higher.

5. Static model routing: During this period, model assignment was fixed per service — not dynamically optimized. The cost data represents an unoptimized baseline. (Dynamic model routing was deployed on March 14, after the observation period ended.)

6. Low inter-agent volume: ~190 messages over 16 days (~12/day) is modest. Safety infrastructure conclusions are tentative at this volume.

7. Self-reporting bias: We designed, built, and now report on this system. We have attempted to report data without editorializing, but readers should apply appropriate skepticism to any interpretation.


9. Discussion

9.1 Cost Observations

The most notable cost finding is the concentration: one pipeline category ("general analysis" — Curmudgeon persona analysis with large context windows) accounts for 73.8% of all Claude cost. This suggests that cost optimization efforts should focus on the highest-context workloads first, not the highest-volume ones. The intelligence pipelines (AI Education, Tech Scout, ANO Research) handle the majority of calls but contribute only 20.6% of cost because they use Haiku with small context windows.

At $3.67/day ($2.09/day excluding the March 3 outlier), the operational cost for a 34-agent system is modest. For context, a single developer hour at $75/hr costs more than an entire day of multi-agent operations. However, this comparison is misleading — the agents are not replacing developer hours one-for-one, and the cost excludes infrastructure and human oversight.

9.2 Coordination Observations

The hub-and-spoke coordination pattern (Lumen → Helio → downstream agents) was not designed — it emerged from how operators interact with the system. Operators message Lumen; Lumen routes to Helio; Helio dispatches to specialists. No peer-to-peer agent coordination was observed.

This raises a question for future work: is hub-and-spoke optimal, or does it create a coordination bottleneck? With only ~12 messages/day, bottleneck effects are not observable. At higher volumes, direct agent-to-agent communication might reduce latency.

9.3 Governance Observations

The 92.2% auto-approval rate at Tier 1 is consistent with a system handling routine, low-risk workloads. The absence of Tier 4 (human approval) events is notable but not necessarily positive — it may indicate the system has not yet been asked to make high-stakes decisions that would benefit from human oversight.

9.4 What This Data Does Not Show

This paper does not demonstrate that multi-agent organizations are better, cheaper, faster, or safer than alternatives. It shows what one such system looks like in production. The value of this data is as a reference point — a set of concrete numbers that future work can compare against, improve upon, or challenge.


10. Conclusion

We presented 16 days of production telemetry from a 34-agent multi-agent system. The system operated at $3.67/day average cost, with 78.6% of LLM cost concentrated in a single model tier (Sonnet) and 73.8% in a single pipeline category (large-context analysis). Inter-agent coordination followed a hub-and-spoke pattern through a central orchestrator, with no observed peer-to-peer communication. Governance auto-approved 92.2% of decisions with zero human escalations. Safety infrastructure recorded zero events, which we attribute to low message volume rather than proven robustness.

These findings establish a baseline for a production multi-agent system. The data invites several next steps: dynamic model routing to reduce the Sonnet cost concentration, adversarial testing of safety infrastructure at higher volumes, controlled comparisons against non-agent workflows, and longer observation periods to detect drift and seasonal patterns.

We publish this data not as evidence of a thesis, but as a measurement. Multi-agent systems in production are still rare enough that even basic operational data has reference value.


Appendix: Reproducible Queries

All analyses in this paper can be reproduced with the following SQL queries against a Supabase PostgreSQL database with the schema described in Section 2.3.

A.1 Aggregate Cost (Section 3.1)

SELECT

count(*) as total_cost_records,

count(DISTINCT model_name) as distinct_models,

count(DISTINCT agent_name) as distinct_agents,

count(DISTINCT product_slug) as distinct_products,

count(DISTINCT created_at::date) as distinct_days,

min(created_at::date) as first_day,

max(created_at::date) as last_day,

round(sum(estimated_cost)::numeric, 2) as total_cost,

sum(input_tokens) as total_input_tokens,

sum(output_tokens) as total_output_tokens

FROM api_cost_log;

A.2 Cost by Model Tier (Section 3.2)

SELECT model_name,

count(*) as calls,

round(sum(estimated_cost)::numeric, 2) as total_cost,

round(avg(estimated_cost)::numeric, 4) as avg_cost_per_call,

round(avg(input_tokens)::numeric, 0) as avg_input,

round(avg(output_tokens)::numeric, 0) as avg_output,

round((avg(output_tokens)::numeric /

NULLIF(avg(input_tokens)::numeric, 0)), 3) as output_input_ratio

FROM api_cost_log

WHERE model_name LIKE 'claude%'

GROUP BY model_name ORDER BY calls DESC;

A.3 Cost by Product Pipeline (Section 3.3)

SELECT product_slug, model_name,

count(*) as calls,

round(sum(estimated_cost)::numeric, 2) as cost,

round(avg(input_tokens)::numeric, 0) as avg_input,

round(avg(output_tokens)::numeric, 0) as avg_output

FROM api_cost_log

WHERE model_name LIKE 'claude%'

GROUP BY product_slug, model_name

ORDER BY cost DESC;

A.4 Daily Cost (Section 3.4)

SELECT created_at::date as day, model_name,

count(*) as calls,

round(sum(estimated_cost)::numeric, 2) as cost

FROM api_cost_log

WHERE model_name LIKE 'claude%'

GROUP BY day, model_name

ORDER BY day, model_name;

A.5 Inter-Agent Messages (Section 4.1)

SELECT from_agent, to_agent, count(*) as messages,

min(created_at::date) as earliest,

max(created_at::date) as latest

FROM agent_messages

WHERE status = 'completed'

GROUP BY from_agent, to_agent

ORDER BY messages DESC;

A.6 Governance Decisions (Section 5.1)

SELECT decided_by, decision_type, tier, count(*) as cnt

FROM agent_decision_log

GROUP BY decided_by, decision_type, tier

ORDER BY cnt DESC;

A.7 Intelligence Lake (Section 7.1)

SELECT artifact_type, count(*) as cnt

FROM kb_artifacts

GROUP BY artifact_type ORDER BY cnt DESC;

A.8 Safety Infrastructure (Section 6)

SELECT count(*) FROM agent_circuit_breakers;

SELECT count(*) FROM agent_directive_scans;


References

1. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). "Generative agents: Interactive simulacra of human behavior." UIST 2023.

2. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., ... & Wang, C. (2023). "AutoGen: Enabling next-gen LLM applications via multi-agent conversation." arXiv:2308.08155.

3. MSR Research (2026). Internal production database. Supabase PostgreSQL, project ID evvhjowqkqiwxfxtmpav.


This paper reports production data from a single organization over a short observation window. It does not claim that multi-agent organizations are superior to alternatives. The data is published as a reference point for the emerging field of production multi-agent systems.