Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Google DeepMind AI Co-clinician: 6 Benchmark Results That Redefine Medical AI in 2026

Preferred by physicians 67% of the time, zero critical errors in 97/98 cases, and beating GPT-5.4 thinking 63% to 30% — here's what the numbers actually show.

MindStudio Team RSS
Google DeepMind AI Co-clinician: 6 Benchmark Results That Redefine Medical AI in 2026

Six Numbers That Tell You Where Medical AI Actually Stands in 2026

Google DeepMind published benchmark results for its AI Co-clinician this spring, and the numbers are specific enough to be worth sitting with. Physicians preferred it 67% of the time over existing clinical tools. It recorded zero critical errors in 97 out of 98 realistic primary care queries. And it was assessed across 140 separate dimensions of consultation skill. Those three figures — 67% physician preference, zero critical errors in 97/98 cases, 140-dimension assessment — are the scaffolding around which the rest of the results hang. If you build AI systems, or think seriously about where they’re headed, this is the benchmark report to understand.

Here is what each number actually shows.


The Physician Preference Test: 67% vs. 26%

DeepMind ran a blind head-to-head comparison. Physicians evaluated the AI Co-clinician against the clinical tools they were already using in March 2026. No marketing framing, no cherry-picked cases. The result: 67% preferred the AI Co-clinician, 26% preferred the existing tool, and 5% were neutral.

That’s not a close race. When a new tool beats the incumbent by more than two-to-one in a blind test with the actual intended users, you’re looking at a meaningful capability gap, not a marginal improvement.

Day one: idea. Day one: app.

DAY
1
DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The test also included a second comparison that’s arguably more interesting. DeepMind ran the AI Co-clinician against GPT-5.4 thinking with search — the model most people paying $20/month currently use. The AI Co-clinician won that matchup 63% to 30%. That’s a 33-point gap against a frontier general-purpose model with search access, tested by physicians on clinical tasks. If you want context on where GPT-5.4 sits relative to other frontier models, the GPT-5.4 vs Claude Opus 4.6 comparison is a useful reference point. The preference isn’t explained by novelty; it’s explained by the system being purpose-built for this specific context.

What makes this preference test credible is the evaluation framework underneath it. DeepMind adapted what they call a “no harm” framework, developed with academic physicians, that tests for two distinct failure modes: errors of commission (the AI says something false) and errors of omission (the AI fails to mention something critical). In medicine, both can cause harm. A system that never lies but routinely omits red flags is still dangerous. The blind preference test was scored against both dimensions.


Zero Critical Errors in 97 of 98 Cases

The 98-query benchmark is the number that the medical AI community will argue about most. DeepMind ran the AI Co-clinician through 98 realistic primary care queries — the kind of open-ended, ambiguous questions that actual patients ask — and recorded zero critical errors in 97 of them.

One case had a critical error. Ninety-seven did not. That’s a 98.98% clean rate on a benchmark designed to surface failure.

The framing in the source material is careful: this is described as a number medical AI has not hit before. That’s a strong claim, and it’s worth holding it to the light. The benchmark is 98 queries, not 98,000. Primary care queries, not edge-case tertiary referrals. The cases were realistic but synthetic. None of that invalidates the result — it contextualizes it. A 97/98 clean rate on a rigorous, physician-designed benchmark is genuinely significant. It’s also a starting point, not a finish line.

The distinction between errors of commission and omission matters here too. An AI that hedges everything technically avoids commission errors but fails on omission. The AI Co-clinician was evaluated on both, which makes the 97/98 figure harder to game than a simpler accuracy metric.


The 140-Dimension Consultation Assessment

This is the benchmark that required the most infrastructure to run. DeepMind worked with physicians at Harvard and Stanford to build 20 synthetic clinical scenarios. Then they recruited 10 real physicians to role-play as patients. The AI Co-clinician ran the consultations. Evaluators then scored the consultations across 140 separate dimensions.

One hundred and forty. Not “did it get the diagnosis right” — that’s one dimension. The 140 include empathy, bedside manner, red flag detection, how it asks follow-up questions, whether it correctly guided a physical exam, how it handles uncertainty, how it communicates next steps. The full range of what a good consultation actually requires.

The headline result: the AI Co-clinician performed at a level comparable to or exceeding primary care physicians in 68 of those 140 dimensions. That’s 48.5% of the assessment criteria. Human physicians still won overall, particularly on red flag detection and guiding critical physical exams. DeepMind’s stated conclusion is that this is a supportive tool, not a replacement. That’s the responsible framing, and it’s also accurate given the 68/140 figure.

But 68 out of 140 is not a rounding error. It means that in roughly half the dimensions of a clinical consultation, an AI system is now performing at or above the level of a trained physician. The gap is real, and it’s closing.


The RXQA Drug Knowledge Benchmark

Drug knowledge is where general-purpose LLMs have historically struggled most. The open web is full of medication misinformation, outdated dosing guidelines, and incomplete interaction data. Models trained on that corpus inherit its gaps.

The RXQA benchmark is built on open FDA data and tests AI systems on open-ended drug questions — the kind a doctor would actually ask during a consultation. Drug interactions, dosing edge cases, contraindications in complex patients. Not multiple choice. Open-ended, with the messy context of a real patient embedded in the question.

The AI Co-clinician surpassed every other frontier AI system on this benchmark. The margin was particularly pronounced on open-ended questions, which is the format that matters clinically. Multiple-choice medical benchmarks are essentially textbook tests. Real medicine is “my patient has been on this for ten years and now her ankles are swelling — is that the drug?” RXQA is designed to approximate that format, and the AI Co-clinician won it.

This result is connected to the system’s architecture. The AI Co-clinician is a real-time, low-latency video model — it watches the patient through a camera, guides physical exams, and integrates what it sees with what it hears. That multimodal grounding appears to improve clinical reasoning in ways that text-only systems don’t replicate. A model that can observe a patient’s face while asking about their symptoms is working with more signal than one that’s reading a transcript. The broader trend toward multimodal, real-time AI is visible in other domains too — Pika Me’s real-time video chat capability is a useful comparison point for understanding how low-latency video inference is being deployed outside medicine.


The Video Exam Capability: What the Demos Actually Showed

The three demo cases published alongside the benchmarks are worth examining in detail, because they show both the capability and the limits.

In the acute pancreatitis case, a physician role-played a patient with severe abdominal pain. The AI Co-clinician guided a physical exam via video, correctly identified epigastric tenderness by asking the patient to palpate above the belly button (after first checking the belly button itself, which is standard practice — start where you don’t expect pain to establish a baseline), and recommended emergency care. The physician evaluator noted that the AI’s differential — appendicitis or pancreatitis — was reasonable, and that the next steps were completely correct. The diagnosis wasn’t pinned down definitively, but the triage was right.

The myasthenia gravis case is more technically interesting. The short version is that the AI performed a telehealth-adapted upward-gaze test that a Harvard physician evaluator said they had never thought to do via telehealth. The AI’s thought log showed it had correctly identified the fatigable nature of the ptosis and diplopia as pointing toward myasthenia gravis rather than Lambert-Eaton syndrome, which gets better as the day progresses rather than worse.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

The rotator cuff case showed both the capability and a specific limitation. The AI correctly guided a range-of-motion exam, noted the patient’s hesitation during movement, and recommended conservative treatment — rest, ice, physical therapy, over-the-counter pain medication — rather than immediately ordering imaging. That’s the right call for the presentation. But the evaluating physician noted that the AI stopped at roughly three of the eight most relevant physical exam tests, didn’t complete the impingement testing, and may have reached a diagnosis before fully ruling out adhesive capsulitis. The technical term for this is premature closure, and it’s a known failure mode in human physicians too. The AI’s version of it is visible in the demo.


What the Evaluation Infrastructure Tells You

The benchmarks matter, but so does how they were built. DeepMind used 20 synthetic clinical scenarios developed with Harvard and Stanford physicians, 10 real physicians as patient role-players, and a 140-dimension rubric. That’s a more rigorous evaluation setup than most AI medical benchmarks, which tend to rely on multiple-choice question sets that don’t approximate real consultations.

The no-harm framework — testing separately for commission and omission errors — is also methodologically important. It’s the difference between measuring whether an AI sounds confident and measuring whether it’s actually safe. Most benchmark leaderboards don’t make that distinction.

Building this kind of evaluation infrastructure is genuinely hard. It requires clinical expertise, scenario design, role-player recruitment, and a rubric that captures the full complexity of a consultation. For teams thinking about how to evaluate AI agents in high-stakes domains, the DeepMind approach is a useful model: synthetic scenarios built with domain experts, real humans as evaluators, and a multi-dimensional rubric that separates different types of failure. MindStudio handles the orchestration layer for multi-model agents — 200+ models, 1,000+ integrations, visual workflow builder — and the same principle applies there: the evaluation design has to come from the domain itself, not from the platform. Meanwhile, tools like Remy apply a similar philosophy to software specification — you write annotated markdown that captures intent and edge cases precisely, and the full-stack application is compiled from it — the spec is the source of truth, not an afterthought.


The Limits the Numbers Don’t Show

The 97/98 clean rate is on 98 queries. The 68/140 performance is on 20 synthetic scenarios. The RXQA benchmark is built on FDA data, which is comprehensive but not exhaustive. These are strong results on well-designed benchmarks, and they’re also not the same as a clinical trial.

The myasthenia gravis case raised a specific question that the evaluators flagged: the AI’s thought log showed it had noticed the drooping eyelid, but it only verbalized that observation after the patient mentioned it. Would it have brought up the ptosis if the patient had talked about something else entirely? That’s not a rhetorical question — it’s a real uncertainty about whether the visual processing is proactive or reactive. The evaluators didn’t have a definitive answer.

The rebound tenderness question in the pancreatitis case is another example. The AI asked about rebound tenderness — a valid clinical question — but the physician evaluator noted that rebound is typically assessed with the patient lying down, and asking about it via telehealth is unusual. The AI asked a technically correct question in a clinically awkward context. That’s a subtle failure mode that a 140-dimension rubric might not fully capture.


So What Does 68 Out of 140 Actually Mean?

DeepMind’s conclusion — supportive tool, not a replacement — is accurate and appropriate. It’s also, if you look at the trajectory, a description of the current moment rather than a permanent ceiling.

TIME SPENT BUILDING REAL SOFTWARE
5%
95%
5% Typing the code
95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

The AI Co-clinician is a real-time, low-latency video model that can guide physical exams, reason about drug interactions, and perform at or above primary care physician level in 48.5% of consultation dimensions. It beat GPT-5.4 thinking with search by 33 points in blind physician preference. It recorded zero critical errors in 97 of 98 cases on a benchmark designed to find them.

For teams building AI systems in other high-stakes domains — legal, financial, educational — the DeepMind evaluation methodology is as instructive as the results. The combination of synthetic scenarios built with domain experts, real human role-players, multi-dimensional rubrics, and separate testing for commission and omission errors is a template worth borrowing. If you’re building agents that operate in domains where what the AI doesn’t say matters as much as what it does, that framework is the right starting point.

The 68 out of 140 number will be higher in the next evaluation. The question worth tracking is which 72 dimensions the AI still loses on, and how fast those gaps close.

The broader pattern here — purpose-built, multimodal, domain-specific systems pulling ahead of general-purpose models on specialized tasks — is visible across the frontier model landscape. The AutoResearch loop pattern that Karpathy described is one framework for thinking about how iterative, domain-grounded AI systems compound their advantages over time. The AI Co-clinician results are a concrete example of that compounding in a high-stakes vertical. It’s a pattern worth watching across every domain where the cost of what an AI omits is as high as the cost of what it gets wrong.

Presented by MindStudio

No spam. Unsubscribe anytime.