Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Google AI Co-Clinician vs. GPT-5.4 with Search: Which Medical AI Do Physicians Actually Prefer?

In blind physician evaluations, Google's AI Co-clinician beat GPT-5.4 thinking with search 63% to 30%. Here's what drove the gap.

MindStudio Team RSS
Google AI Co-Clinician vs. GPT-5.4 with Search: Which Medical AI Do Physicians Actually Prefer?

GPT-5.4 Lost to a Medical AI You’ve Probably Never Heard Of

In a blind physician evaluation, Google’s AI Co-clinician beat GPT-5.4 thinking with search 63% to 30%. Not a close call. Not a rounding error. Thirteen percentage points, in a domain where GPT-5.4 with search is supposed to be the strongest general-purpose reasoning tool available to anyone with a browser.

If you build AI products, that result should stop you cold. Because the gap isn’t explained by raw intelligence. It’s explained by something more specific — and more instructive.

Why the Margin Matters

GPT-5.4 thinking with search is not a weak baseline. It’s the model most people in AI circles reach for when the task is hard and the stakes are real. It reasons through problems step by step, it can pull live information, and it has been trained on more medical text than any physician will read in a lifetime. If you asked a room of AI engineers to pick the best general-purpose model for a medical consultation task, most would pick something in that family.

And yet, in head-to-head physician preference evaluations, AI Co-clinician won by a margin that isn’t close. For context: the same evaluation also compared AI Co-clinician against existing clinician-facing tools that physicians already use in practice — tools they’ve presumably chosen because they find them useful. AI Co-clinician won that comparison 67% to 26%, with 5% neutral.

Day one: idea. Day one: app.

DAY
1
DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

These aren’t benchmark scores on a standardized test. These are physicians watching consultations and saying which one they’d rather have in their corner.

The no harm framework used in the evaluation tested for two distinct failure modes: errors of commission (the AI says something wrong) and errors of omission (the AI leaves out something critical). In medicine, the second category is often more dangerous than the first. A physician who confidently gives the wrong diagnosis is bad. A physician who fails to ask about radiating back pain in a patient with severe abdominal pain is potentially catastrophic.

Out of 98 realistic primary care queries, AI Co-clinician recorded zero critical errors in 97 cases. That’s the number DeepMind is leading with, and they’re right to. Medical AI has not hit that mark before, at least not under this kind of evaluation rigor.

What GPT-5.4 Can’t See

Here’s the non-obvious part of this comparison: GPT-5.4 thinking with search is a text-in, text-out system. Even with search enabled, it’s reasoning from what you type. AI Co-clinician is watching you.

That distinction is load-bearing. The three demo cases DeepMind published make this concrete.

In the rotator cuff case, the AI didn’t just ask about shoulder pain — it watched the patient perform range-of-motion tests through a live video feed and noted that the patient hesitated during the movement. That hesitation is clinical signal. A physician watching a patient wince and pause before completing a motion reads that differently than a patient who says “yeah, it hurts a little.” The AI caught it. It also did something that the physician evaluator specifically called out as best practice: it started the physical exam in the area that didn’t hurt, establishing a baseline before moving to the painful region. That’s not a trick. That’s standard clinical technique, and the AI did it unprompted.

In the myasthenia gravis case, the AI spotted the eyelid droop from video before the patient explicitly described it — though the evaluators noted a legitimate question about whether it would have flagged the droop if the patient had never mentioned it. More impressive was what came next: the AI requested a sustained upward gaze test, a physical exam maneuver specific enough that the Harvard physician evaluator said they had never thought to perform it via telehealth. The AI didn’t just reach the right diagnosis (myasthenia gravis over Lambert-Eaton, based on the fatigability pattern worsening through the day). It adapted a specialist-level exam to a telehealth context in a way that a generalist physician hadn’t considered.

GPT-5.4 with search cannot do any of that. It can tell you what a sustained upward gaze test is. It cannot watch you perform one.

This is the architectural gap that explains the 63-30 result. The comparison isn’t really about which model has better medical knowledge. It’s about what counts as a medical consultation. A consultation involves observation. AI Co-clinician observes. GPT-5.4 reads.

The Evidence Assembled

The benchmark structure is worth understanding in detail, because it’s more rigorous than most AI evaluations you’ll read about.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

DeepMind worked with physicians at Harvard and Stanford to build 20 synthetic clinical scenarios. Ten real physicians then role-played as patients. The AI Co-clinician ran the consultations. The assessments covered 140 distinct dimensions — not “did it get the diagnosis right,” but empathy, bedside manner, red flag detection, follow-up question quality, and physical exam guidance. AI Co-clinician matched or exceeded primary care physicians in 68 of those 140 dimensions.

Human physicians still won overall, particularly on red flags and critical exam guidance. DeepMind is explicit about this. Their framing is “supportive tool, not replacement,” and the evaluation data supports that framing rather than contradicting it.

The drug knowledge results are separately interesting. The RXQA benchmark, built on open FDA data, tests AI systems on open-ended medication questions — drug interactions, dosing, edge cases — posed the way physicians actually ask them, not as multiple-choice options. AI Co-clinician surpassed every other frontier AI system on this benchmark. The significance of that is easy to miss: most medical AI benchmarks are multiple choice, which is a textbook format, not a clinical one. Real medication questions sound like “she’s been on this for ten years and now her ankles are swelling, is that related?” Open-ended, ambiguous, with patient context embedded. That’s the format AI Co-clinician is winning on.

For builders thinking about how to evaluate AI systems in specialized domains, this benchmark design is instructive. The broader comparison of GPT-5.4 against other frontier models shows how much benchmark design shapes apparent performance — the same model can look very different depending on whether you’re testing it on structured or open-ended tasks.

Where the AI Fell Short

The rotator cuff case is the honest part of this evaluation, and it’s worth dwelling on.

The AI correctly identified a probable rotator cuff injury. It correctly recommended conservative treatment — rest, ice, physical therapy, over-the-counter pain medication — and correctly determined that an MRI wasn’t immediately necessary. Those are the right calls. The physician evaluator confirmed them.

But the same evaluator noted that the AI demonstrated “premature closure.” It chose roughly three of the eight most appropriate physical exam tests. It didn’t perform impingement testing. Without that, you can’t confidently distinguish rotator cuff tendinitis from adhesive capsulitis, and the differential matters for treatment. The AI got the management right, but it got there with less diagnostic certainty than the case warranted.

That’s a specific, nameable failure mode. Premature closure — settling on a diagnosis before exhausting the relevant exam — is a known cognitive error in medicine. The fact that AI Co-clinician exhibits it is not surprising. The fact that the evaluators caught it and named it precisely is what makes this evaluation credible.

This is also why the “supportive tool” framing is the right one, not just the responsible one. A physician using AI Co-clinician as a second opinion or a consultation scaffold would catch the missed impingement test. A patient using it alone might not know what they’re missing.

What This Means for Builders

The 63-30 result is not primarily a story about Google beating OpenAI. It’s a story about what happens when you build a system that matches the actual structure of the task rather than approximating it with a general-purpose model.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY
Designed the data model
Picked an auth scheme — sessions + RBAC
Wired up Stripe checkout
Deployed to production
Live at yourapp.msagent.ai

Medical consultation is multimodal. It involves observation, real-time adaptation, and physical examination. A text-in, text-out model — even one with search and extended reasoning — is solving a different problem than the one physicians actually face. AI Co-clinician was built for the actual problem. That’s why it won.

The implication for anyone building AI products in specialized domains is direct: the general-purpose frontier model is not always the right foundation. Sometimes the task has structure that a general model can’t access, and the right move is to build something that fits the structure of the task rather than hoping the general model is good enough.

This is also where the question of evaluation design becomes critical. If DeepMind had tested AI Co-clinician on multiple-choice medical benchmarks, the results would look different — and less meaningful. The 140-dimension consultation assessment, the open-ended RXQA format, the no-harm framework with errors of omission — these evaluation choices are what make the results legible. The strategic differences between how Anthropic, OpenAI, and Google approach agent infrastructure partly explain why Google was positioned to build something like this: their investment in multimodal, real-time, low-latency video processing is infrastructure that OpenAI hasn’t prioritized in the same way.

For teams building AI workflows that need to chain models, route tasks, or integrate specialized agents with broader systems, the architecture question matters as much as the model choice. Platforms like MindStudio handle this orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for composing agents and workflows — which is relevant when your system needs to combine a specialized model like AI Co-clinician with downstream tools, documentation, or clinical decision support.

The RXQA results also point to something builders often underestimate: domain-specific training data quality compounds. The open web is a poor source for drug interaction data. FDA data is structured and authoritative. A model trained on the right data for the right task will outperform a general model trained on more data of lower relevance. This applies well beyond medicine — it applies anywhere the general web is a noisy proxy for the actual knowledge domain.

For teams thinking about how to build specialized AI applications on top of existing infrastructure, the spec-driven approach is worth considering. Remy, MindStudio’s full-stack app compiler, takes annotated markdown specs as input and compiles them into complete TypeScript applications — backend, database, auth, deployment — treating the spec as the source of truth rather than the generated code. The same principle applies to AI system design: the cleaner and more precise your specification of the task, the better the system you can build for it.

The Gap Is Closing, and the Direction Is Clear

DeepMind’s conclusion — supportive tool, not replacement — is accurate based on the data. Human physicians still outperform AI Co-clinician overall, particularly on the highest-stakes decisions. The 68 of 140 dimensions where AI Co-clinician matched or exceeded primary care physicians is impressive, but 72 dimensions is still a lot of ground.

TIME SPENT BUILDING REAL SOFTWARE
5%
95%
5% Typing the code
95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

What’s notable is the trajectory. This is the first time a multimodal medical AI has been evaluated this rigorously and held up under that scrutiny. The evaluation methodology itself — real physicians as patient role-players, 140 consultation dimensions, open-ended drug questions, no-harm framework — is a contribution independent of the results.

And the 63-30 result against GPT-5.4 thinking with search tells you something specific: when the task has structure that a general model can’t access, a purpose-built system wins. Not because it’s smarter in the abstract. Because it’s solving the right problem.

That’s the lesson worth taking from this comparison. Not that Google beat OpenAI. That task-fit beats raw capability when the task is hard enough.

The broader benchmark comparisons between GPT-5.4, Claude Opus 4.6, and Gemini show similar patterns in other domains — the right model for a specific workflow often isn’t the one with the highest aggregate score. Medical consultation just makes the stakes legible in a way that coding benchmarks don’t.

Presented by MindStudio

No spam. Unsubscribe anytime.