Operating a 34-Agent Organization: Cost, Coordination, and Safety Patterns from 16 Days of Production Data

Authors: MSR Research — Quantum (AI Optimization), Nebula (Data Science), Docsmith (Documentation) Date: March 2026 Version: 1.0 — Draft Category: Research Paper PRD: `prds/2026-03-14-1600_empirical-multi-agent-cost-quality.prd.md`

1. Introduction

The concept of multi-agent AI systems — where multiple specialized language model instances collaborate on complex tasks — has generated substantial interest in both research and industry (Park et al., 2023; Wu et al., 2023). However, most published work focuses on simulation environments, benchmark tasks, or architectural proposals. Operational data from production multi-agent deployments is rare.

This paper addresses that gap by reporting telemetry from a production multi-agent system that has been operating since February 2026. The system comprises 34 named agents organized into 6 functional teams, processing real workloads across intelligence gathering, content generation, inter-agent coordination, and governance.

We make no claims about this system being optimal or generalizable. Our contribution is the data itself: what does it actually cost to run a multi-agent organization? How do agents coordinate? What does governance look like in practice? What does safety infrastructure report when it's running in production?

1.1 Research Questions

- RQ1: What does it cost to operate a multi-agent organization, and how is cost distributed across models, agents, and pipelines?

- RQ2: How do agents coordinate in production, and what communication patterns emerge?

- RQ3: How does governance function, and what proportion of decisions require human oversight?

- RQ4: What does safety infrastructure report in a production multi-agent system?

1.2 Scope and Limitations

This study has inherent limitations that readers should consider before interpreting results:

- n=1: We report on a single organization. Patterns may not generalize.

- 16-day window: The observation period (February 27 – March 14, 2026) is short.

- No baseline: We have no "non-agent" comparison for the same workloads.

- Self-report: We built the system we are measuring, introducing potential bias in what is logged and how it is interpreted.

- Incomplete coverage: Cost data covers only API-logged calls. Claude Code CLI sessions, manual operations, and development costs are not captured.

2. System Description

2.1 Agent Architecture

The system comprises 34 named agents organized into 6 teams:

Team	Agents	Focus
Development	11	Frontend, backend, database, DevOps, QA, security, integration, docs, AI optimization, data science, ML
Grants	8	Research, writing, compliance, budget, impact, communications, analytics, marketing
Executive	2	CEO advisory, CTO advisory
Product	4	Product management, scrum, UX research, policy advisory
Coordination	2	Orchestration, technology scouting
Stories	7	Editorial leadership, news, beat reporting, copy editing, production, circulation

Each agent has a system prompt loaded from a skill file, defined competencies, and explicit handoff rules specifying which agents can receive its output.

2.2 Model Tiers

During the observation period, the system used three Claude model tiers:

Tier	Model ID	Input $/1M tokens	Output $/1M tokens	Primary Use
Cost-optimized	claude-haiku-4-5-20251001	$0.80	$4.00	High-volume intelligence pipelines
Balanced	claude-sonnet-4-20250514	$3.00	$15.00	Analysis, extraction, agent execution
Narrative	claude-sonnet-4-5-20250929	$3.00	$15.00	Story generation (daily)

Additionally, Voyage AI (voyage-3-lite) was used for embedding generation, and Firecrawl for web scraping. Model selection during this period was static per service — not dynamically routed per request.

2.3 Data Collection

All data was collected through production logging, not instrumented for this study:

- API cost log (`api_cost_log`): Every LLM API call logs model, agent, product, input/output tokens, and estimated cost. 1,516 records.

- Agent decision log (`agent_decision_log`): Governance decisions (approve/deny) with agent identity, tier, and reason. 308 records.

- Agent messages (`agent_messages`): Inter-agent directives with sender, receiver, status, and timestamps. 190+ completed messages.

- Knowledge base (`kb_artifacts` + `kb_embeddings`): Research artifacts with type classification and 512-dimensional vector embeddings. 604 records.

- Safety infrastructure (`agent_circuit_breakers` + `agent_directive_scans`): Circuit breaker trips and directive scan flags. 0 records each.

2.4 Methodology

All analyses use SQL queries against the production Supabase (PostgreSQL) database. Queries are reproduced in the Appendix for replication. Cost estimates use the token pricing in Table 2.2. No statistical modeling is applied — we report descriptive statistics only, consistent with the exploratory nature of this work.

3. Cost Analysis (RQ1)

3.1 Aggregate Cost

Over 16 days of operation, total API cost was $58.76, comprising:

Category	Records	Cost	% of Total
Claude LLM calls	1,091	$58.49	99.5%
Firecrawl scraping	192	$0.19	0.3%
SendGrid email delivery	75	$0.08	0.1%
Voyage AI embeddings	158	$0.00	<0.1%
Total	1,516	$58.76	100%

Daily average: $3.67/day. Token volume: 14.9 million input tokens and 920,899 output tokens.

3.2 Cost by Model Tier

Model	Calls	Total Cost	Avg Cost/Call	Avg Input Tokens	Avg Output Tokens	Output/Input Ratio
Haiku	582	$12.05	$0.021	3,257	729	0.224
Sonnet	495	$46.00	$0.093	26,228	949	0.036
Sonnet 4.5	14	$0.44	$0.031	1,074	1,899	1.768

Key finding: Sonnet accounts for 78.6% of Claude cost ($46.00) despite representing only 45.4% of calls (495/1,091). Haiku handles 53.3% of calls but only 20.6% of cost. The 4.4x cost multiplier per call (Sonnet vs Haiku) compounds with Sonnet's 8x higher average input token count.

The output/input ratio reveals distinct usage patterns:

- Haiku (0.224): Reads 4.5x more than it writes — used for extraction and classification tasks

- Sonnet (0.036): Reads 28x more than it writes — used for analysis of large context windows

- Sonnet 4.5 (1.768): Writes more than it reads — used for narrative generation (daily stories)

3.3 Cost by Product Pipeline

Pipeline	Model	Calls	Cost	Avg Input	Avg Output
General analysis (null)	Sonnet	329	$43.17	37,744	1,198
AI Education	Haiku	382	$8.20	3,458	740
ANO Research	Haiku	98	$2.62	4,749	834
Research Tunnels	Sonnet	74	$1.58	4,633	501
Tech Scout	Haiku	96	$1.20	1,120	606
Agent Messaging	Sonnet	68	$0.99	2,633	440
Stories	Sonnet 4.5	14	$0.44	1,074	1,899
ANO Research	Sonnet	14	$0.15	2,206	280
Tech Scout	Sonnet	10	$0.11	1,220	483

Key finding: "General analysis" (329 Sonnet calls, $43.17) accounts for 73.8% of all Claude cost. These are analysis calls with large context windows (avg 37,744 input tokens) — primarily Curmudgeon persona analysis and report generation. This single category is the dominant cost driver.

The three intelligence pipelines (AI Education, ANO Research, Tech Scout) are all cost-optimized on Haiku, totaling $12.02 for 576 calls — an average of $0.021 per call.

3.4 Daily Cost Patterns

Day	Haiku Calls	Sonnet Calls	Total Cost	Notes
Mar 3	0	213	$27.41	Bulk analysis run — single-day spike
Mar 10	218	35	$5.43	High Haiku volume (intelligence collection)
Mar 13	214	29	$5.75	High Haiku volume (intelligence collection)
Typical day	20–22	13–25	$0.50–$3.75	Steady-state operation

The March 3 spike ($27.41, 46.6% of total 16-day cost) was caused by a bulk analysis batch — 213 Sonnet calls in a single day. Excluding this outlier, the remaining 15 days averaged $2.09/day.

3.5 Cost Efficiency: Haiku vs Sonnet

For pipelines that use both models, we can compare efficiency:

Pipeline	Haiku Cost/Call	Sonnet Cost/Call	Sonnet Premium
ANO Research	$0.027	$0.011	0.4x (Sonnet cheaper here)
Tech Scout	$0.013	$0.011	0.9x (comparable)

The Sonnet calls in these pipelines have lower input token counts than the general analysis calls, making the per-call cost comparable to Haiku. The cost advantage of Haiku is most pronounced in high-volume, high-context workloads.

4. Coordination Patterns (RQ2)

4.1 Inter-Agent Messaging

Over the observation period, 190+ inter-agent messages were completed. The top communication pairs:

Sender	Receiver	Messages	Period
Lumen (ops bot)	Helio (orchestrator)	22	Mar 2–11
Lumen	Byte (backend)	17	Feb 27–Mar 11
Helio	Quest (QA)	12	Mar 9–14
Helio	Quantum (optimizer)	12	Mar 9–14
Lumen	Atlas (CEO advisor)	11	Mar 2–11
Lumen	Forge (DevOps)	11	Feb 27–Mar 5
Lumen	Iris (marketing)	11	Mar 2–10
Lumen	Nova (grant writer)	10	Feb 27–Mar 9
Lumen	Pixel (frontend)	8	Mar 2–11
Email Router	Helio	7	Mar 14
Claude Code	Helio	6	Mar 2–8

4.2 Hub-and-Spoke Topology

The messaging data reveals a hub-and-spoke pattern with two hubs:

1. Lumen (operations bot): The primary message sender, dispatching directives to 15+ agents across all teams. Lumen serves as the human-to-agent interface — translating operator instructions into inter-agent directives.

2. Helio (orchestrator): The primary message receiver (35 inbound messages from Lumen, Claude Code, and Email Router) and the primary downstream dispatcher (routing to Quest and Quantum for QA and optimization tasks).

No agent-to-agent communication was observed that bypassed both hubs. This suggests the current system operates as a mediated coordination model rather than a peer-to-peer mesh — all coordination flows through designated coordination points.

4.3 Team Communication Patterns

Team	Inbound Messages	Outbound Messages	Most Active Agent
Development	47	0	Byte (17 received)
Grants	23	0	Nova (10 received)
Coordination	35	24	Helio (35 received, 24 sent)
Executive	11	0	Atlas (11 received)
Stories	0	0	—
Product	0	0	—

Key finding: Only the Coordination team (Helio) sends messages to other agents. All other teams are message receivers only. The Stories and Product teams received zero inter-agent messages during this period — their work was triggered by scheduled pipelines (systemd timers) rather than inter-agent communication.

5. Governance (RQ3)

5.1 Decision Distribution

The agent decision log recorded 308 governance decisions:

Decision Maker	Decision	Tier	Count	%
Agent Executor Worker	Approve	1 (auto)	170	55.2%
Helio (orchestrator)	Approve	1 (auto)	36	11.7%
Helio	Approve	2 (peer)	9	2.9%
24 individual agents	Approve	1 (auto)	78	25.3%
7 agents	Approve	2 (peer)	21	6.8%
Atlas, Apex, Helio	Approve	3 (committee)	6	1.9%
Agent Executor Worker	Deny	1 (auto)	3	1.0%

5.2 Approval Tier Analysis

Tier	Name	Count	%	Meaning
1	Auto-approve	284	92.2%	Agent had sufficient trust score for automatic approval
2	Peer review	30	9.7%	Required review by another agent before proceeding
3	Committee	6	1.9%	Required review by multiple senior agents
4	Human	0	0%	Would require human approval — not triggered

Key finding: 92.2% of decisions were auto-approved (Tier 1), with only 3 denials (1.0%) across the entire period. No decision reached Tier 4 (human approval). This suggests either (a) the trust scoring system is well-calibrated, routing low-risk decisions to auto-approve, or (b) the system is not yet handling workloads that trigger higher-tier governance. With only 308 decisions over 16 days, we cannot distinguish between these interpretations.

The 3 denials were all made by the executor worker at Tier 1, suggesting they were capability-based rejections (agent lacked required capabilities) rather than trust-based denials.

6. Safety Infrastructure (RQ4)

6.1 Circuit Breakers

The circuit breaker system monitors inter-agent message frequency and trips when any agent pair exchanges more than 5 messages within a 30-minute window.

Result: Zero trips over the entire observation period.

6.2 Directive Scanners

The directive scanner checks every inter-agent message for four categories of potential prompt injection: base64-encoded payloads, instruction override attempts, encoded commands, and role injection patterns.

Result: Zero flags over the entire observation period.

6.3 Interpretation

Zero safety events can mean:

1. The system is safe: No runaway loops or injection attempts occurred, and the safety infrastructure would have caught them if they had.

2. The volume is too low: With ~190 inter-agent messages over 16 days (~12/day), the system may not have reached the traffic levels where safety failures typically emerge.

3. The thresholds are too permissive: A 5-message/30-minute circuit breaker threshold may be too high to detect slow-building coordination failures.

We cannot determine which interpretation is correct from this data. A meaningful safety analysis would require either (a) adversarial testing (deliberately attempting to trigger failures) or (b) substantially higher message volumes over a longer period.

What we can report: The safety infrastructure adds approximately 10ms of overhead per message (directive scanning). Over 190 messages, this represents ~1.9 seconds of total compute — negligible operational cost.

7. Intelligence Lake

7.1 Artifact Volume

The knowledge base contained 604 artifacts as of March 14, 2026, collected from February 15 onward (28 days of collection).

Artifact Type	Count	%
Fact	308	51.0%
Entity	137	22.7%
Statistic	75	12.4%
Legislation	49	8.1%
Discovery	20	3.3%
Quote	15	2.5%

7.2 Embedding Coverage

All 604 artifacts have corresponding 512-dimensional vector embeddings (voyage-3-lite), providing 100% semantic search coverage. The embedding cost for the entire knowledge base was under $0.01 (158 Voyage API calls at $0.02/1M tokens).

7.3 Collection Rate

Average collection rate: 21.6 artifacts/day over 28 days. This rate is driven by automated research tunnel schedules (Tuesday/Friday for AI education; daily for political discourse and ANO research).

8. Limitations

We reiterate and expand on the limitations stated in Section 1.2:

1. Single organization (n=1): All data comes from one system with one operator. We cannot claim these patterns generalize to other multi-agent deployments.

2. Short observation window: 16 days of cost data and 28 days of knowledge base collection. Seasonal patterns, long-term cost trends, and drift effects are not observable.

3. No baseline comparison: We have no data on what these workloads would cost with a single model, a human team, or a different multi-agent architecture. Without a baseline, we cannot claim the multi-agent approach is more or less efficient than alternatives.

4. Incomplete cost capture: The $58.76 total excludes Claude Code CLI sessions (used extensively for development), Stripe processing fees, infrastructure costs (server, domain, Supabase), and human operator time. The true cost of operating this system is substantially higher.

5. Static model routing: During this period, model assignment was fixed per service — not dynamically optimized. The cost data represents an unoptimized baseline. (Dynamic model routing was deployed on March 14, after the observation period ended.)

6. Low inter-agent volume: ~190 messages over 16 days (~12/day) is modest. Safety infrastructure conclusions are tentative at this volume.

7. Self-reporting bias: We designed, built, and now report on this system. We have attempted to report data without editorializing, but readers should apply appropriate skepticism to any interpretation.

9. Discussion

9.1 Cost Observations

The most notable cost finding is the concentration: one pipeline category ("general analysis" — Curmudgeon persona analysis with large context windows) accounts for 73.8% of all Claude cost. This suggests that cost optimization efforts should focus on the highest-context workloads first, not the highest-volume ones. The intelligence pipelines (AI Education, Tech Scout, ANO Research) handle the majority of calls but contribute only 20.6% of cost because they use Haiku with small context windows.

At $3.67/day ($2.09/day excluding the March 3 outlier), the operational cost for a 34-agent system is modest. For context, a single developer hour at $75/hr costs more than an entire day of multi-agent operations. However, this comparison is misleading — the agents are not replacing developer hours one-for-one, and the cost excludes infrastructure and human oversight.

9.2 Coordination Observations

The hub-and-spoke coordination pattern (Lumen → Helio → downstream agents) was not designed — it emerged from how operators interact with the system. Operators message Lumen; Lumen routes to Helio; Helio dispatches to specialists. No peer-to-peer agent coordination was observed.

This raises a question for future work: is hub-and-spoke optimal, or does it create a coordination bottleneck? With only ~12 messages/day, bottleneck effects are not observable. At higher volumes, direct agent-to-agent communication might reduce latency.

9.3 Governance Observations

The 92.2% auto-approval rate at Tier 1 is consistent with a system handling routine, low-risk workloads. The absence of Tier 4 (human approval) events is notable but not necessarily positive — it may indicate the system has not yet been asked to make high-stakes decisions that would benefit from human oversight.

9.4 What This Data Does Not Show

This paper does not demonstrate that multi-agent organizations are better, cheaper, faster, or safer than alternatives. It shows what one such system looks like in production. The value of this data is as a reference point — a set of concrete numbers that future work can compare against, improve upon, or challenge.

10. Conclusion

We presented 16 days of production telemetry from a 34-agent multi-agent system. The system operated at $3.67/day average cost, with 78.6% of LLM cost concentrated in a single model tier (Sonnet) and 73.8% in a single pipeline category (large-context analysis). Inter-agent coordination followed a hub-and-spoke pattern through a central orchestrator, with no observed peer-to-peer communication. Governance auto-approved 92.2% of decisions with zero human escalations. Safety infrastructure recorded zero events, which we attribute to low message volume rather than proven robustness.

These findings establish a baseline for a production multi-agent system. The data invites several next steps: dynamic model routing to reduce the Sonnet cost concentration, adversarial testing of safety infrastructure at higher volumes, controlled comparisons against non-agent workflows, and longer observation periods to detect drift and seasonal patterns.

We publish this data not as evidence of a thesis, but as a measurement. Multi-agent systems in production are still rare enough that even basic operational data has reference value.

Appendix: Reproducible Queries

All analyses in this paper can be reproduced with the following SQL queries against a Supabase PostgreSQL database with the schema described in Section 2.3.

A.1 Aggregate Cost (Section 3.1)

SELECT
count(*) as total_cost_records,
count(DISTINCT model_name) as distinct_models,
count(DISTINCT agent_name) as distinct_agents,
count(DISTINCT product_slug) as distinct_products,
count(DISTINCT created_at::date) as distinct_days,
min(created_at::date) as first_day,
max(created_at::date) as last_day,
round(sum(estimated_cost)::numeric, 2) as total_cost,
sum(input_tokens) as total_input_tokens,
sum(output_tokens) as total_output_tokens
FROM api_cost_log;

A.2 Cost by Model Tier (Section 3.2)

SELECT model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as total_cost,
round(avg(estimated_cost)::numeric, 4) as avg_cost_per_call,
round(avg(input_tokens)::numeric, 0) as avg_input,
round(avg(output_tokens)::numeric, 0) as avg_output,
round((avg(output_tokens)::numeric /
NULLIF(avg(input_tokens)::numeric, 0)), 3) as output_input_ratio
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY model_name ORDER BY calls DESC;

A.3 Cost by Product Pipeline (Section 3.3)

SELECT product_slug, model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as cost,
round(avg(input_tokens)::numeric, 0) as avg_input,
round(avg(output_tokens)::numeric, 0) as avg_output
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY product_slug, model_name
ORDER BY cost DESC;

A.4 Daily Cost (Section 3.4)

SELECT created_at::date as day, model_name,
count(*) as calls,
round(sum(estimated_cost)::numeric, 2) as cost
FROM api_cost_log
WHERE model_name LIKE 'claude%'
GROUP BY day, model_name
ORDER BY day, model_name;

A.5 Inter-Agent Messages (Section 4.1)

SELECT from_agent, to_agent, count(*) as messages,
min(created_at::date) as earliest,
max(created_at::date) as latest
FROM agent_messages
WHERE status = 'completed'
GROUP BY from_agent, to_agent
ORDER BY messages DESC;

A.6 Governance Decisions (Section 5.1)

SELECT decided_by, decision_type, tier, count(*) as cnt
FROM agent_decision_log
GROUP BY decided_by, decision_type, tier
ORDER BY cnt DESC;

A.7 Intelligence Lake (Section 7.1)

SELECT artifact_type, count(*) as cnt
FROM kb_artifacts
GROUP BY artifact_type ORDER BY cnt DESC;

A.8 Safety Infrastructure (Section 6)

SELECT count(*) FROM agent_circuit_breakers;
SELECT count(*) FROM agent_directive_scans;

References

1. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). "Generative agents: Interactive simulacra of human behavior." UIST 2023.

2. Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., ... & Wang, C. (2023). "AutoGen: Enabling next-gen LLM applications via multi-agent conversation." arXiv:2308.08155.

3. MSR Research (2026). Internal production database. Supabase PostgreSQL, project ID evvhjowqkqiwxfxtmpav.

This paper reports production data from a single organization over a short observation window. It does not claim that multi-agent organizations are superior to alternatives. The data is published as a reference point for the emerging field of production multi-agent systems.