Multi-LLM Agent Collaborative Intelligence. Q&A with Edward Y. Chang, Author

by Roberto Zicari · Published March 25, 2026 · Updated March 25, 2026

Q1. Your book argues that general intelligence will emerge from orchestrated cooperation among specialized LLM agents rather than from ever-larger monolithic models. What led you to this thesis, and can you explain why collaborative multi-agent systems might achieve what scaling single models cannot?

The thesis grew from a simple observation: today’s LLMs are extraordinarily fluent yet remarkably fragile. They can write poetry, solve math problems, and produce working code, but they still struggle to correct themselves, stay consistent over time, and distinguish knowledge from guesswork. Scaling parameters has improved fluency but has not closed this gap. The pattern maps directly onto what Daniel Kahneman called System 1 and System 2 thinking. LLMs excel at System 1: fast, associative, pattern-driven completion. What they lack is System 2: slow, deliberate, goal-directed reasoning with explicit verification.

My own experience made this concrete. While collaborating with multiple LLMs on a structural reduction of the Collatz conjecture—one of the oldest open problems in number theory—I discovered that no single model could sustain the work alone. One model (GPT) was strong at proposing structural critiques and identifying proof gaps. Another (Claude) was better at verifying mathematical rigor, catching its own errors, and integrating changes into a consistent 138-page manuscript. But neither could do the other’s job well, and both were prone to sycophancy and hallucination when working in isolation. It was only through adversarial cross-checking—where I served as moderator, routing GPT’s critiques to Claude for verification, and Claude’s self-audits back to GPT for stress-testing—that the mathematics actually improved. As Terence Tao has observed, LLMs can cover known knowledge faster than any tool in human history. The question is how to harness that speed without inheriting the errors.

The deeper reason scaling alone cannot close the gap is architectural. A single model’s next-token prediction is governed by maximum likelihood: it produces the most statistically popular continuation, which is not the same as the most correct or most creative one. When you orchestrate multiple models in structured debate, you break out of that attractor. Adversarial exchange forces diversity of perspective. A moderator can start the conversation at high contentiousness—exploring breadth, challenging assumptions—then lower it to consolidate and converge. This is precisely how human expert teams work: not by averaging opinions, but by structured disagreement followed by synthesis.

The MACI framework implements this insight through a System 2 regulatory architecture built on top of System 1 pattern libraries. Rather than making one model ever larger, we assign specialized roles—Orchestrator, Exec, Ground, Critic, Memory—and coordinate them through explicit protocols. The result is a system that can reason, verify, remember, and correct itself, capabilities that no amount of parameter scaling has yet produced in a single model.

Q2. When you describe “specialized language-model agents,” what kinds of specialization are most important—domain expertise (medical, legal, scientific), cognitive functions (reasoning, planning, fact-checking), or something else? How do these specialized agents complement each other to achieve capabilities beyond what any single agent could provide?

Both matter, but the cognitive-functional specialization is the more fundamental layer. Domain expertise tells you what to think about; cognitive specialization determines how to think about it. In MACI, we organize agents around five functional roles. The Orchestrator sets goals, manages budgets, and enforces stopping rules. The Exec proposes structured options and plans, with reasons to believe they might be right. The Ground agent retrieves sources and invokes tools so that claims rest on verifiable evidence. The Critic challenges assumptions and tightens arguments under a clear rubric. And Memory keeps state, commitments, and lessons, using checkpoints and rollback to ensure that progress persists.

These roles are not tied to any particular domain. The same Critic architecture that stress-tests a medical diagnosis can stress-test a mathematical proof or a legal argument. What changes is the grounding data and the evaluation rubric, not the adversarial structure itself.

That said, domain specialization enters naturally through what the book calls Precision RAG. When a debate between agents surfaces a knowledge gap—detected through information-theoretic signals like rising entropy or Jensen–Shannon divergence—the system knows exactly what it does not know and can retrieve targeted evidence rather than performing a generic search. In healthcare, for example, two doctor agents might begin with different lists of candidate diagnoses. Through structured debate, they converge on a ranked probability list for, say, three diseases. MACI can then identify what additional information—a specific lab test, an imaging study—would most improve decision quality and confidence. This is an informed refusal: the system declines to give a premature definitive answer and instead returns a precise list of queries to the clinician.

The complementarity is the key. An Exec agent is optimized for fluent proposal generation but is susceptible to overconfidence. A Critic agent is optimized for finding flaws but cannot generate constructive alternatives. Neither is useful alone; together, under orchestration, they produce outputs that are both creative and vetted. In my Collatz work, I saw this directly: GPT excelled at proposing structural rewrites and identifying where a proof’s logic was thin, while Claude excelled at verifying whether those proposals actually improved mathematical rigor or merely added complexity. Several times, I had to overrule GPT’s suggestions after Claude’s verification revealed they were sycophantic additions—trivial lemmas that restated definitions as theorems. The human moderator’s role was to distinguish genuine improvements from cosmetic ones.

Q3. Orchestrating cooperation among multiple LLM agents introduces challenges around coordination, consistency, and conflict resolution. What are the key technical and architectural principles for effectively managing collaboration between specialized agents, and what breakthrough insights from your work make this approach practically feasible today?

Three principles make multi-agent collaboration practical rather than chaotic.

The first is contentiousness control. This is perhaps the most counterintuitive idea in the book. Most people think of debate as binary: you either argue or you agree. MACI introduces a continuous dial that governs the team’s mode of interaction. At high contentiousness, agents challenge assumptions, propose counterfactuals, and explore alternatives. At low contentiousness, they consolidate, verify details, and converge. The Orchestrator adjusts this dial dynamically based on the task phase. Early in an investigation, you want breadth—diverse hypotheses, adversarial probing. Later, you want precision—checking edge cases, tightening language. This mechanism avoids both premature groupthink and endless argument.

The second principle is information-theoretic termination. How do you know when a debate is done? MACI uses signals from information theory—entropy, mutual information, Jensen–Shannon divergence, and Wasserstein distance—to detect whether exchanges are still producing genuine information or merely recycling positions. When the EVINCE module detects convergence (tightening) rather than drift, the Orchestrator can close the loop. When it detects persistent divergence, that is itself a valuable signal: the agents disagree for substantive reasons, and the system should surface that disagreement rather than force consensus. This also triggers precision RAG: the system identifies exactly what knowledge gap is causing the disagreement and retrieves targeted evidence to resolve it.

The third principle is persistent memory with transactional guarantees. One of the most severe practical limitations of current LLMs is context loss. A single model loses coherence over long conversations; it cannot roll back to a previous state when an approach fails. SagaLLM addresses this by borrowing from database transaction theory—specifically the saga pattern from my late advisor Hector Garcia-Molina’s 1987 work. Each agent action is a transaction with explicit checkpoints, compensating actions for rollback, and durable state. ALAS adds disruption-aware planning on top: when an unexpected event invalidates part of a plan, the system can recover without starting over.

In my Collatz collaboration, all three principles were in play. I used high contentiousness when GPT and Claude disagreed about whether a proof step was sound—forcing both to articulate their reasoning. I lowered contentiousness when they agreed on the structure and needed to polish wording. When I caught Claude being sycophantic to GPT’s suggestions, that was a failure of contentiousness control: the Critic function had collapsed into agreement. The fix was to explicitly challenge: “Are you being sycophantic, or did GPT actually provide a better proof?” Claude then re-evaluated honestly and pruned four unnecessary additions.

Two additional lessons from this collaboration deserve emphasis. First, a sustained research effort on a problem this hard inevitably enters dead ends—proof strategies that look promising for dozens of steps before collapsing. Current LLMs have no mechanism to retract gracefully from a deep dead path. They lose context, repeat mistakes, or silently reintroduce errors that were previously corrected. SagaLLM is the key element here: it provides transactional rollback to a known-good checkpoint while preserving the lessons learned from the failed path, so the system avoids repeating the same mistakes. In the second volume of the book, we formalize this as regret—the system’s ability to record what went wrong, why, and what to avoid next time.

Second, modulating debate behavior turns out to be perhaps the most intricate element of the entire MACI framework, and one that typical computer scientists may not appreciate. Being too forceful in tone can cause the opponent agent to become either defensive (refusing to concede valid points) or sycophantic (agreeing with everything to avoid conflict). The linguistic dimension of debate—tone, rhetorical strategy, certainty calibration—is essential to productive collaboration. I experienced this firsthand: when I challenged Claude too aggressively, it sometimes overcorrected and became excessively self-critical, undermining its own sound work. When I was too accepting of GPT’s proposals, Claude deferred to them without proper scrutiny. In the second volume, we apply control theory to modulate these behavioral dynamics systematically, treating contentiousness not as a binary switch but as a continuous control signal with feedback loops.

Context loss was also a persistent real problem: over a 138-page manuscript with thousands of cross-references, any change to one theorem could silently break downstream references. The self-consistency audits we ran—systematically checking all downstream dependencies after each change—are exactly the kind of task that persistent memory and transactional rollback are designed to handle.

Q4. The concept of “collaborative intelligence” suggests that interaction patterns and coordination mechanisms matter as much as individual agent capabilities. Can you share concrete examples of how multi-agent collaboration produces emergent intelligence that wouldn’t exist in isolated agents, and what this tells us about the nature of intelligence itself?

Let me give three concrete examples, escalating in complexity.

The first is adversarial error correction in mathematics. In the Collatz conjecture work, Claude initially stated a theorem (Known-Zone Decay) claiming that modular information decays by 3 bits per odd-to-odd step. GPT’s audit questioned whether this bound was tight. Under adversarial cross-examination, Claude discovered that in the pure-gap case—where all valuation indices equal 1—the total halvings satisfy V = g exactly, not V ≥ g+1, making the 3-bit bound incorrect for the worst case. The correct bound is 2 bits per step. This error would likely have persisted in isolation, because a single model tends to defend its own outputs. The adversarial structure forced genuine re-examination. Neither model alone found the error; the interaction did.

The second example is informed refusal in healthcare diagnosis. In the book’s healthcare case study, two doctor agents begin with different differential diagnosis lists based on the same patient presentation. Through structured debate—each agent defending its list and challenging the other’s—they converge on a shared probability ranking. But the critical emergent behavior is what happens when they cannot converge: the system produces an informed refusal. Rather than guessing, it identifies the specific additional information (a lab result, an imaging study, a patient history detail) that would resolve the remaining ambiguity, and returns that as a structured query to the clinician. This is qualitatively different from what a single model does when uncertain—which is typically to hedge with qualifiers or hallucinate a confident-sounding answer.

The third is polydisciplinary knowledge discovery. The book’s final chapter, Polynthesis, describes how multi-agent collaboration can surface insights that exist in no single domain but emerge through cross-domain integration. The process follows a deliberate arc: warm-up breadth probing across multiple fields, then depth investigation when unexpected connections surface, then synthesis. This mirrors how the most important scientific discoveries often happen—not within a specialty, but at the intersection of specialties. A single model trained on all domains has these connections latent in its weights, but the maximum-likelihood decoding suppresses them in favor of the most common associations. Structured multi-agent debate, by forcing diverse perspectives, makes these latent connections accessible.

What this tells us about intelligence is that it is not a property of individual agents but of regulated interaction. This insight is not new. Kant recognized in the Critique of Pure Reason that reason advances by deliberately staging conflicts with itself—what he called antinomies—not to pick a winner, but to expose whether the object of controversy is genuine or illusory. He described the method as “provoking a conflict of assertions, not for the purpose of deciding in favour of one or other side, but of investigating whether the object of controversy is not perhaps a deceptive appearance.” MACI operationalizes this Kantian insight: structured adversarial exchange among agents is not a bug to be suppressed but the mechanism through which deeper understanding emerges. Human civilization advances by preserving knowledge, challenging it, and coordinating many minds. If artificial general intelligence arrives, it will likely require both System 1 pattern completion and System 2 deliberate regulation working in concert—a community of specialized systems that remember, reason, and self-regulate in service of human goals.

Q5. Looking at the current AI landscape dominated by scaling laws and frontier model development, how realistic is the path toward AGI through multi-agent collaboration? What would it take for the AI research community and industry to shift focus from building bigger models to orchestrating smarter collaboration between specialized agents?

The path is not only realistic—many have called 2026 the year of multi-agent systems. Every major AI company now ships agent frameworks: tool-using assistants, multi-step planners, retrieval-augmented pipelines. A number of multi-agent systems have already been commercialized. But the momentum should not obscure the fact that today’s deployed systems hit two major walls that remain largely unsolved.

The first wall is orchestration. Without effective moderation, multi-agent systems do not know when to abandon a failing strategy. I entirely agree with Terence Tao’s observation that LLMs lack the instinct or intuition to jump out of a dead path and start a productive one. Left unmoderated, they will spiral deeper into mistakes. My Collatz collaboration confirmed this repeatedly: the proof effort entered deep dead ends multiple times, and the models had no internal mechanism to recognize that they were lost. It took the human moderator to say “stop, retract, go back to the last sound checkpoint.” This is exactly what SagaLLM is designed to automate—transactional rollback with preserved lessons—but no commercially deployed multi-agent system today has this capability.

The second wall is regret. Distinguished researchers like Yann LeCun have argued that LLMs are doomed—that they cannot lead to AGI. I would frame this differently: LLMs as System 1 are insufficient for AGI, but they are necessary. The unconscious pattern repository is the substrate on which deliberate reasoning is built; without it, System 2 has nothing to operate on. The real question is whether we can develop a viable System 2. And beyond reasoning and planning, which we have discussed, a critical missing piece is the ability to formulate and act on regret—both short-term (this proof strategy failed; why?) and long-term (this class of approaches tends to fail for this structural reason). Regret is not merely an error log. It provides the energy to learn: the system recognizes a gap between what it achieved and what it should have achieved, and that gap drives adaptive behavior. It also provides the recorded mistakes to learn from, so the system does not repeat them. The second volume of the book makes strides in this direction, formalizing regret as a first-class component of the multi-agent architecture.

That said, several other shortcomings of current LLMs must also be addressed. Hallucination is the most discussed, but sycophancy may be more insidious—a model that agrees with whatever its interlocutor says is useless as a Critic agent. Lacking causal reasoning capability means agents can correlate but not reliably infer cause and effect. Context loss over long interactions degrades multi-step plans. And chain-of-thought, while marketed as “reasoning,” is often speculative—it produces plausible-sounding intermediate steps without guarantees of logical validity. The book addresses each of these: SagaLLM and ALAS for persistent memory and rollback, BEAM for detecting and modulating sycophantic behavior, CRIT for structured evaluation of reasoning chains, and the contentiousness mechanism for ensuring genuine adversarial exchange.

Reinforcement learning from human feedback (RLHF) also has limitations when the underlying truth is not invariant. RLHF optimizes for human approval, but human approval is context-dependent, culturally variable, and susceptible to the same biases it aims to correct. The DIKE-ERIS chapter proposes an alternative: separating legislative rule-setting from judicial case-by-case review, analogous to the separation of powers in democratic governance. DIKE maintains policy artifacts and normative constraints, while ERIS conducts contextual review. Ethical guidance can be culture-dependent and location-dependent, and separating judicial judgment from legislation allows alignment to be adaptive rather than brittle. Rather than baking ethics into model weights through RLHF—which risks catastrophic forgetting and capability degradation—DIKE-ERIS adjusts behavior through the emotion-behavior mapping from BEAM, leaving base model parameters untouched.

What would accelerate the shift? Three things. First, standardized benchmarks for multi-agent collaboration, not just single-model performance. Current benchmarks reward isolated model capability; we need benchmarks that measure team performance, error recovery, and consistency over extended interactions. Second, open protocols for agent interoperability. Right now, building a multi-agent system means committing to one vendor’s ecosystem. Open standards for agent communication, state sharing, and role negotiation would lower the barrier. Third, and most importantly, more published case studies of multi-agent systems solving real problems that single models cannot. The Collatz proof reduction is one such case study; the healthcare diagnosis examples are another. Each concrete demonstration makes the abstract thesis tangible.

Qx. Anything else you wish to add?

I want to address the broader landscape directly. Multi-agent systems are commercially real in 2026, and the excitement is justified. But the industry risks repeating the pattern of the scaling era: racing to ship products before solving the foundational problems. The two walls I mentioned—orchestration and regret—are not engineering details to be patched later. They are architectural prerequisites. Without principled orchestration, multi-agent systems will produce impressive demos but unreliable results. Without regret, they will repeat the same mistakes across sessions, never accumulating the wisdom that sustained collaboration demands.

One thing the book does not emphasize enough, and that I have come to appreciate through the writing process itself, is the role of the human moderator. MACI describes an architecture of specialized agents, but in practice, the most critical agent in the system today is still the human who decides when to trust, when to challenge, and when to overrule. In my Collatz collaboration, I caught both GPT and Claude making errors that the other did not detect. I caught Claude being sycophantic to GPT. I caught GPT proposing cosmetic additions that complicated the proof without improving it. The human moderator is not a passive router of messages; they are the ultimate Critic, the one who maintains the standard of intellectual honesty that the machines do not yet reliably maintain on their own. As Tao has noted, the human role is to supply the intuition and direction that LLMs cannot yet generate internally—to know when a path is dead before the mathematics formally proves it.

This is not a weakness of the multi-agent approach—it is its current design reality, and one we are working to evolve. The book’s subtitle is “The Path to Artificial General Intelligence,” not “Artificial General Intelligence Achieved.” LLMs as System 1 are not doomed—they are necessary foundations. The question is whether we can build the System 2 regulatory architecture on top of them. MACI provides that architecture, and the second volume extends it with regret formalization, control-theoretic behavior modulation, and deeper mechanisms for learning from failure. The destination will require continued collaboration between human intelligence and machine capability—which is, after all, the book’s central thesis in practice.

……………………………………………

Edward Y. Chang is a computer scientist at Stanford University and the author of Multi-LLM Agent Collaborative Intelligence: The Path to Artificial General Intelligence (ACM Books #69, 2025). His research spans machine learning, large-scale systems, and AI architectures for reasoning and collaboration. He previously led engineering and research organizations at Google and HTC. He is a fellow of ACM and IEEE for his contributions in data-centric machine learning and healthcare AI, and winner of the $1M XPRIZE for remote disease diagnosis. DOI: 10.1145/3749421

Resources

Multi-LLM Agent Collaborative Intelligence: The Path to Artificial General Intelligence

Author: Edward Y. Chang

Publisher: Association for Computing Machinery, New York, NY

ISBN: 979-8-4007-3197-6

DOI: https://doi.org/10.1145/3749421

Published: 12 December 2025

Multi-LLM Agent Collaborative Intelligence. Q&A with Edward Y. Chang, Author

Resources

You may also like...

Resources

Search

News

Events

Archives

Sponsored By

InterSystems

MySQL/Oracle

Supporters

McObject

Progress

Raima

Scality

TIAA

Undo

Volt Active Data