Google DeepMind's AI Co-Clinician: 4 Benchmark Results That Surprised Even the Evaluators

Four Numbers That Tell You How Good Google DeepMind’s AI Co-Clinician Actually Is

Google DeepMind published benchmark results for its AI Co-clinician this week, and four numbers in particular are worth your attention: 67% physician preference over existing clinical tools, a 63-to-30 win against GPT-5.4 thinking with search in blind head-to-head evaluation, zero critical errors in 97 of 98 realistic primary care queries, and performance matching or exceeding primary care physicians in 68 of 140 consultation dimensions. If you build AI systems professionally, those numbers deserve more than a headline skim.

The reason to care isn’t that AI is about to replace doctors — DeepMind’s own framing is explicitly “supportive tool, not replacement,” and the evaluation data supports that framing. The reason to care is that this is one of the most rigorously constructed evaluations of a multimodal AI agent published to date, and the methodology tells you as much as the results.

Here’s what each number actually means.

67%: What Physicians Chose When Nobody Told Them What to Root For

Blind evaluations are rare in AI. Most benchmark comparisons involve the model’s own developers, cherry-picked prompts, or multiple-choice formats that bear no resemblance to real use. This one was different.

DeepMind put AI Co-clinician in front of physicians in a blind head-to-head against the clinical AI tools those physicians were already using in March 2026 — the tools they had chosen, paid for, and integrated into their workflows. No marketing context. No framing. Just: which response is better?

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

67% of the time, physicians preferred AI Co-clinician. 26% preferred the existing tool. 5% were neutral.

That’s not a close race. When you’re testing against the incumbent tool that practitioners have already adopted and trust, a 67-26 split is a significant signal. Incumbents have enormous advantages: familiarity, workflow integration, the benefit of the doubt. Beating them by that margin in a blind test suggests the quality gap is large enough to be obvious.

The comparison against GPT-5.4 thinking with search is the number that will generate the most discussion. GPT-5.4 versus other frontier models is a live debate in AI circles right now, and GPT-5.4 with search enabled is a serious system — not a straw man. AI Co-clinician won that matchup 63% to 30%. That’s a 33-point gap against a model that most practitioners would consider state-of-the-art for general reasoning tasks. For a deeper look at how GPT-5.4 stacks up against other frontier models across real workflows, the head-to-head comparison with Claude Opus 4.6 is worth reading alongside these results.

The evaluation framework used here matters. DeepMind adapted what they call a “no harm” framework, specifically testing for errors of commission (the AI says something wrong) and errors of omission (the AI fails to say something critical). In medicine, the second category is often more dangerous than the first. A system that confidently gives wrong answers is bad. A system that fails to flag a red flag is potentially fatal. Testing for both is the right call, and it’s not how most AI benchmarks work.

97 of 98: The Zero Critical Error Number Nobody Expected

The 97-of-98 result on realistic primary care queries is the number that surprised even the evaluators, according to the source material. Medical AI has not hit this before.

To be precise: across 98 realistic primary care queries, AI Co-clinician recorded zero critical errors in 97 of them. One case had an issue. Ninety-seven did not.

What counts as a critical error here? The no-harm framework distinguishes between minor gaps and errors that could cause patient harm — missed diagnoses that require urgent intervention, incorrect medication guidance, failure to escalate appropriately. The rotator cuff case in the demos is instructive: the physician evaluator noted that the AI demonstrated “premature closure,” choosing roughly three of eight appropriate physical exam tests and missing impingement testing specifically. That’s a real limitation. But it’s not a critical error — the AI still recommended conservative treatment correctly, didn’t order unnecessary imaging, and didn’t miss the acuity level. A gap in thoroughness is different from a dangerous recommendation.

The distinction matters for builders. If you’re evaluating whether to integrate a medical AI component into a product, “how often does it give dangerous advice” is a different question from “how often does it give complete advice.” The 97/98 number addresses the first question. The 68/140 consultation dimension result (more on that below) addresses the second.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

The RXQA benchmark result adds another layer. RXQA is built on open FDA data and tests open-ended medication questions — drug interactions, dosing, edge cases. These are the questions where LLMs trained on general web data historically fall apart, because drug information on the open web is inconsistent, outdated, and often wrong. AI Co-clinician surpassed every other frontier AI system on this benchmark, specifically on open-ended questions posed the way doctors actually ask them: ambiguous, contextual, with the messy specifics of a real patient situation rather than a clean multiple-choice format. Understanding how token-based pricing works for AI models becomes relevant here too — systems running complex multi-turn clinical reasoning at scale carry real cost implications that product teams need to model before deployment.

That last point is worth dwelling on. Multiple-choice medical benchmarks are a known problem in AI evaluation. A model can score well on USMLE-style questions while failing completely on “my patient has been on this medication for ten years and now has ankle swelling — is that related?” The RXQA benchmark was designed to test the latter. Winning on that benchmark is a different kind of signal than winning on a textbook test.

68 of 140: What a Real Consultation Assessment Looks Like

The 140-dimension consultation assessment is the most ambitious piece of the evaluation, and the methodology is worth understanding in detail.

DeepMind worked with physicians at Harvard and Stanford to construct 20 synthetic clinical scenarios. Then they recruited 10 real physicians to role-play as patients. AI Co-clinician ran the consultations. Evaluators then assessed performance across 140 distinct dimensions of consultation skill.

Not 140 questions with right or wrong answers. 140 dimensions: empathy, bedside manner, red flag detection, follow-up question quality, physical exam guidance, how the AI communicated uncertainty, whether it appropriately escalated, whether it started physical exam in the non-painful area first (a best practice the evaluators specifically highlighted as impressive in the rotator cuff case). This is the kind of rubric that medical schools use to evaluate human residents.

AI Co-clinician matched or exceeded primary care physicians in 68 of those 140 dimensions. Human physicians still won overall, particularly on red flags and critical exam guidance. But 68 out of 140 is nearly half, and it’s the first time a multimodal medical AI has been tested at this level of granularity and held up this well.

The myasthenia gravis case illustrates what “matching physician performance” looks like in practice. The AI spotted the eyelid droop from video, correctly identified the fatigable nature of the ptosis and diplopia as pointing toward neuromuscular junction pathology, asked the key differentiating question about whether symptoms were worse later in the day (which distinguishes myasthenia gravis from Lambert-Eaton myasthenic syndrome), and then requested a sustained upward gaze test — a physical exam maneuver so specialized that the Harvard physician evaluator said they had never thought to perform it via telehealth. The AI didn’t just get the diagnosis right. It performed a more thorough telehealth-adapted exam than an experienced physician said they would have run.

That’s what 68 of 140 looks like when it’s working.

The Real-Time Video Layer Changes the Evaluation Entirely

All four of these benchmark results are built on top of a capability that most AI evaluations don’t test at all: real-time multimodal video processing.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

AI Co-clinician isn’t reading text descriptions of symptoms. It’s watching the patient through a camera, observing breathing patterns, gait, facial features, and range of motion in real time. In the rotator cuff case, the AI tracked the patient’s range of motion from video, noted that the patient hesitated during the movement, and adjusted its exam sequence accordingly. In the acute pancreatitis case, it guided the patient to palpate their own abdomen, started in the non-painful area to establish a baseline, and then directed them to the epigastric region — exactly where the pain was localized — based on the patient’s responses.

This is a different class of system than a chatbot that answers medical questions. It’s an agent that observes, reasons, and adapts its physical examination in real time based on what it sees. The benchmark results above are all downstream of that capability.

For builders thinking about what this means for their own systems: the interesting design question isn’t “can I build a medical chatbot” but “what does real-time multimodal observation unlock in my domain?” The same architecture — live video, reasoning over what’s observed, adaptive next steps — applies to physical therapy, occupational health assessments, fitness coaching, and a range of other contexts where the current state of the patient’s body is relevant information. If you’re building agent workflows that need to chain visual observation with adaptive reasoning, MindStudio handles this kind of orchestration across 200+ models and 1,000+ integrations, with a visual builder for composing the agent logic without writing the plumbing from scratch.

The benchmark results also highlight where the architecture has real limits. The premature closure critique on the rotator cuff case — the AI chose three of eight appropriate tests, missed impingement testing specifically — is a genuine gap. The physician evaluator noted they couldn’t definitively rule out adhesive capsulitis from the exam the AI ran. The AI got the acuity right and the treatment recommendation right, but it stopped the physical exam too early. That’s a solvable problem, but it’s a real one, and it’s the kind of gap that only shows up when you test against physicians who know what a complete exam looks like.

What the Methodology Tells Builders

The benchmark design here is as instructive as the results. A few things DeepMind did that most AI evaluations skip:

Blind evaluation against real incumbents. Not against a weaker baseline, not against a model from two years ago. Against the tools practitioners are actually using today.

Errors of omission, not just commission. Testing for what the AI failed to say, not just what it said wrong. This is the correct framing for any high-stakes domain.

Open-ended questions, not multiple choice. The RXQA benchmark specifically tests the kind of ambiguous, contextual questions that real practitioners ask. This is harder to score but more predictive of real-world performance.

140 dimensions, not one accuracy number. A single accuracy metric on a medical benchmark tells you almost nothing about whether a system is safe to deploy. Granular rubrics across empathy, red flag detection, exam guidance, and follow-up quality tell you much more.

If you’re building AI systems in any high-stakes domain — legal, financial, clinical, safety-critical engineering — this evaluation methodology is worth studying as a template. The question “how accurate is it” is almost always the wrong question. The right questions are: what does it fail to say, how does it handle ambiguity, and what does a domain expert think of the quality of its reasoning process, not just its final answer?

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

For teams building production applications that need to encode this kind of structured evaluation logic, the spec-driven approach is worth considering. Remy lets you write your application as annotated markdown — a spec where prose carries intent and annotations carry precision — and compile that into a complete TypeScript stack with backend, database, auth, and deployment handled automatically. When your evaluation criteria are as structured as DeepMind’s 140-dimension rubric, having the spec as the source of truth rather than scattered implementation code makes iteration significantly faster.

The Gap Is Real, and It’s Closing

DeepMind’s conclusion — supportive tool, not replacement — is the right one given the data. Human physicians still outperformed AI Co-clinician overall, particularly on the dimensions that matter most when something goes wrong. The premature closure on the rotator cuff case is a real limitation. The question of whether the AI would have spotted the eyelid droop without the patient mentioning it first is unresolved.

But 68 of 140 consultation dimensions matched or exceeded. Zero critical errors in 97 of 98 queries. A 33-point win over GPT-5.4 thinking with search in blind physician preference. These are not incremental improvements on a toy benchmark. They’re results from a rigorous evaluation built with Harvard and Stanford physicians, tested against real practitioners, and published with enough methodological transparency to critique.

The gap between AI and physician performance in clinical consultation is real. It’s also closing faster than most people in medicine expected. Builders who are paying attention to the methodology — not just the headline numbers — are the ones who will know how to evaluate the next set of results when they arrive.