Microsoft 365 Copilot And The End Of The Single-model Era In Enterprise AI

Steve Gustavson, Microsoft’s corporate vice president for design and research. (Microsoft Photo)

[Editor’s Note: Agents of Transformation is an independent GeekWire series, underwritten by Accenture, exploring the adoption and impact of AI and agents. See coverage of our related event.]

Using an AI model still comes with an unspoken asterisk: Verify before you act. Fact-check it. Google it. Ask a colleague. The burden of accuracy has always landed on the human at the end of the day. But Microsoft thinks it has a way to shift that burden — have two AIs keep tabs on each other.

In an era when workforce tasks are increasingly being handled by AI agents, this multi-model strategy now reaches into something human workers assumed was theirs alone: the judgment call. The human-in-the-loop had long been the one non-negotiable in AI workflows. Microsoft’s approach doesn’t eliminate it, but it does raise the question of how much of that role we’re willing to hand over.

‘Two heads are better than one’

Microsoft isn’t alone in this bet. Amazon Web Services, Google, and others are building platforms that give enterprises access to multiple models through a single interface.

AWS Bedrock offers access to foundation models from multiple providers, while Google’s Gemini Enterprise presents a single front door for workplace AI. Microsoft’s distinction is that it’s embedding multi-model review directly into a productivity tool used by millions of workers.

We saw the first implementation of this plan last week with new upgrades to Microsoft 365 Copilot. Its Researcher agent can now use OpenAI’s GPT to draft a response, then have Anthropic’s Claude review it for accuracy, completeness, and citation quality before finalizing it.

“We intentionally want a diversity of opinions,” Steve Gustavson, Microsoft’s corporate vice president for design and research, told GeekWire in an interview. “Two heads are better than one when they come together.”

That’s not a trivial concern. Research has already shown that AI users tend to outsource critical thinking to models they perceive as authoritative. If we’re already surrendering judgment to a single model, can having a second one push back on the first be the check that’s been missing?

It’s a question Microsoft has been wrestling with in designing Critique and Council, the two new features within its Researcher agent.

“Our research consistently shows that workers continue to crave both deeper trust in AI and quality content,” Gustavson said. “People are either over-trusting AI — accepting claims they shouldn’t — or under-trusting it and not getting the full value. Both are design and technical opportunities.”

Take Microsoft’s Critique feature, for example. Gustavson said Microsoft designed it around a deliberate handoff: GPT leads the generation, and Claude steps in as the reviewer.

“The separation matters because evaluation is a different cognitive mode than generation,” he said. “When one model does both, you get the same blind spots twice. When a second model’s job is to validate the first, you get something structurally different.”

This creates a “powerful feedback loop that delivers higher-quality results across factual accuracy, analytical breadth, and presentation,” Gaurav Anand, Microsoft’s corporate vice president for engineering, wrote in a technical blog post about M365’s Critique feature.

Multi-model isn’t just a proof of concept — it’s live, and it’s already the default experience inside Researcher. But Gustavson is quick to point out that most workers won’t care which models are running under the hood. The models, in his view, should be invisible.

“The average user wants phenomenal outputs. They want to be able to trust them,” he said. “Do they need to know it’s 5.2 versus whatever? I don’t think so.”

Gustavson disputes that this is a case of the “blind leading the blind,” stressing that tuning the models is how to avoid hallucinations. With Researcher, “Claude has proven to be a fantastic synthesizer and sort of check on what the GPT models might be doing.”

However, Gustavson said Microsoft is continuously evaluating the performance of single models versus double models, as well as putting “an LLM judge in between the two” to see the trade-offs.

Gustavson said Microsoft plans to move away from promoting specific model names altogether, shifting the focus to what a worker is trying to accomplish. For example, he said, workers could specify that they’re in finance, and Copilot would route work to whichever models best handle Excel, data synthesis, and analysis — no model-picking required.

The enterprise AI pendulum

For Microsoft, multi-model is less of a feature than the inevitable direction of enterprise AI. Gustavson calls it a natural progression, noting that Copilot started out with a single model.

Since then, he said, the industry has been swinging between what models can do, what the product experience should be, and where the competitive moat exists.

“I think this is just a natural evolution,” he said. “Two models are better than one.”

With models leapfrogging each other every few months, Microsoft isn’t betting on any single one, but rather trying to build something that outlasts them all.

As organizations move from experimenting with AI to depending on it for consequential decisions, the single-model approach starts to show its limits. The question may be less whether enterprises should adopt multi-model than whether they’re ready to accept a system where checks are automated, models are invisible, and AI reviews AI before a human ever sees the output.

Beyond the initial integration into the Researcher agent, Gustavson said Microsoft plans to extend the multi-model approach to its other AI tools. He hopes the approach becomes standard across the industry. In his view, building multi-model review into agentic workflows is both good governance and good design.

For those building agentic experiences, Gustavson’s advice is simple: treat agents like any process with meaningful consequences. The key question: “Who checks the work?”