Google AI Co-clinician vs GPT-5.4 Thinking: Which Medical AI Do Physicians Actually Prefer?

63% to 30%: What Physician Preference Tells You About Medical AI in 2026

When you’re choosing between Google’s AI Co-clinician and GPT-5.4 thinking with search for a medical AI application, you’re not choosing between a good option and a bad one. You’re choosing between two genuinely capable systems — and the gap between them turns out to be real, measurable, and instructive about what actually matters in clinical AI.

The headline number: in a blind physician preference test, AI Co-clinician beat GPT-5.4 thinking with search 63% to 30%. That’s not a rounding error. That’s a 33-point spread against the model most people currently pay $20/month for, evaluated by physicians who didn’t know which system they were looking at.

Understanding why that gap exists tells you something useful — not just about these two systems, but about what the next generation of medical AI actually needs to do.

The Test Was Designed to Be Hard to Game

Before you can interpret the 63-30 result, you need to understand what the evaluation was actually measuring. This wasn’t a benchmark where you feed the model a multiple-choice question and count correct answers. DeepMind built something more adversarial.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

The evaluation framework they used is called a “no harm” framework, adapted with academic physicians. It tests for two distinct failure modes: errors of commission (the AI says something false) and errors of omission (the AI fails to surface something critical). In medicine, the second category is at least as dangerous as the first. A system that never hallucinates but routinely fails to flag a red flag is not a safe system.

Across 98 realistic primary care queries, AI Co-clinician recorded zero critical errors in 97 of 98 cases. That number has been described as one medical AI has not hit before. The 98th case isn’t described as a catastrophic failure — but the 97/98 result is the kind of specificity that separates a real evaluation from a marketing claim.

The physician preference comparison against GPT-5.4 thinking with search was run separately from the zero-error benchmark, but they’re measuring related things. Physicians preferred AI Co-clinician not because it was friendlier or had a better interface — they were evaluating clinical reasoning, appropriate triage, and the quality of the consultation itself.

What GPT-5.4 Thinking With Search Can’t See

Here’s the structural difference that explains most of the gap: AI Co-clinician is a real-time, low-latency video model. It watches the patient through a camera. GPT-5.4 thinking with search, however capable its reasoning, is processing text.

That distinction matters enormously in the three demo cases DeepMind published.

In the acute pancreatitis case, the AI didn’t just ask about abdominal pain — it guided the patient through a physical exam via video, correctly identifying epigastric tenderness by instructing the patient to palpate above the belly button (not at it, which is standard practice: start where you don’t expect pain to establish a baseline). It then tested for rebound tenderness. One of the physician evaluators noted they personally wouldn’t attempt rebound tenderness assessment via telehealth — but acknowledged the question itself was clinically appropriate.

In the myasthenia gravis case, the AI spotted ptosis (eyelid droop) visually through the camera. There’s a nuance here worth flagging: the AI’s thought log showed it had noticed the drooping, but it only verbalized the observation after the patient mentioned it. Whether the AI would have raised it unprompted is genuinely unclear. The Harvard physician evaluator asked exactly this question: had the patient talked about something else entirely, would the AI have flagged the drooping eyelid? Unknown.

What is clear is that the AI then asked exactly the right follow-up question: does the double vision get worse as the day goes on? That question is designed to distinguish myasthenia gravis from Lambert-Eaton myasthenic syndrome — a condition that improves as the day progresses rather than worsening. That’s specialist-level differential diagnosis reasoning, not pattern matching on chief complaint.

The AI then instructed the patient to sustain upward gaze for 30 seconds — a telehealth-adapted physical exam maneuver for myasthenia gravis that the Harvard physician evaluator said they had never thought to do via telehealth. That’s not the AI following a checklist. That’s the AI generating a novel clinical approach appropriate to the constraints of the medium.

GPT-5.4 thinking with search can reason about myasthenia gravis. It can tell you about the Lambert-Eaton distinction. What it cannot do is watch your eyelid droop in real time, notice that you hesitated when lifting your arm, or instruct you through a range-of-motion exam and observe the result.

The Drug Knowledge Gap Is a Separate Problem

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The physician preference comparison is one data point. The RXQA benchmark result is a different one, and it matters for a different reason.

RXQA is built on open FDA data and asks open-ended medication questions — the kind of questions that arise in real consultations. Drug interactions. Dosing edge cases. The patient who’s been on a medication for ten years and is now developing new symptoms. AI Co-clinician surpassed every other frontier AI system on this benchmark.

The key word is “open-ended.” Most medical AI benchmarks are multiple choice. Multiple choice tests whether a model can recognize the right answer when it’s presented. Open-ended tests whether the model can generate the right answer from scratch, with the messy context of an actual patient situation. Those are different cognitive tasks, and historically LLMs trained on general web data have struggled with the second one because drug information on the open web is unreliable.

The RXQA result suggests AI Co-clinician has been trained or fine-tuned on higher-quality pharmaceutical data, or that its reasoning architecture handles the ambiguity of real drug questions better than general-purpose models. Probably both.

The 140-Dimension Assessment and What 68/140 Actually Means

DeepMind worked with physicians at Harvard and Stanford to build 20 synthetic clinical scenarios. Ten real physicians role-played as patients. The AI ran the consultations. Then they assessed performance across 140 dimensions of consultation skill.

Not 140 questions. 140 dimensions. Empathy. Bedside manner. Red flag detection. Follow-up question quality. Physical exam guidance. That’s a more complete picture of what a clinical consultation actually requires than any benchmark that reduces medicine to correct/incorrect.

AI Co-clinician matched or exceeded primary care physicians in 68 of those 140 dimensions. Human physicians still won overall, particularly on red flag detection and critical exam guidance. DeepMind’s stated conclusion is that this is a supportive tool, not a replacement.

That framing is accurate and also somewhat undersells what 68/140 represents. This is the first time a multimodal medical AI has been evaluated this rigorously and held up at this level. The gap is real. It’s also closing.

For builders thinking about where to deploy AI in clinical workflows, 68/140 is a useful map. The dimensions where AI performs at physician level are probably the ones where AI augmentation adds the most value with the least risk. The dimensions where humans still dominate are the ones that need human oversight in any deployment.

What This Means If You’re Building Medical AI Applications

The 63-30 physician preference result is not primarily a statement about which model to use. It’s a statement about what architecture to use.

If you’re building a medical AI application that operates purely in text — intake forms, documentation, prior authorization drafts, clinical note summarization — GPT-5.4 thinking with search is a genuinely capable tool. The gap between it and AI Co-clinician in that context is probably much smaller than 33 points, because the video capability that drives most of AI Co-clinician’s advantage simply doesn’t apply.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

If you’re building anything that involves real-time patient interaction, physical assessment, or the kind of dynamic clinical reasoning that depends on observing the patient rather than just hearing them describe themselves, the architecture question becomes central. Text-only models have a structural ceiling in that context.

The rotator cuff case illustrates this precisely. The AI noticed that the patient hesitated while lifting his arm. It noticed that the patient only performed one of the two movements it requested, and correctly followed up on the missing one. Those observations came from watching video. A text-based system would have had to rely entirely on what the patient chose to report.

When you’re thinking about which models to chain together for a clinical workflow, the orchestration layer matters as much as the model selection. Platforms like MindStudio handle this kind of multi-model composition — 200+ models, 1,000+ integrations, and a visual builder for connecting agents and workflows — which becomes relevant when you’re trying to combine a video-capable model with downstream documentation, referral, or scheduling systems.

The Honest Limitations

The physician evaluators in the demos were not uniformly impressed. The rotator cuff case drew specific criticism: the AI chose roughly three of the eight most appropriate physical exam tests, didn’t do impingement testing, and the evaluator noted that without those additional tests, you can’t definitively distinguish rotator cuff tendinitis from adhesive capsulitis. The AI reached the right conservative treatment recommendation, but via premature closure on the diagnosis.

The rebound tenderness question in the pancreatitis case was flagged as something the evaluator personally wouldn’t ask via telehealth — not because it’s wrong, but because rebound tenderness is typically assessed with the patient lying down in front of you.

And the ptosis question remains open. The AI saw the drooping eyelid in its thought log but didn’t mention it until the patient did. That’s a meaningful distinction from a human clinician who would typically say “I notice your eyelid is drooping” as an opening observation.

These aren’t disqualifying failures. They’re the kind of specific, documented limitations that a responsible deployment needs to account for. DeepMind published them, which is more than most AI labs do with their medical evaluations.

The Model Comparison You Should Actually Be Running

The 63-30 result is a useful prior, but it’s not your answer. Your answer depends on your specific use case, your patient population, your regulatory context, and what failure modes you can tolerate.

For AI builders evaluating models for clinical applications, the comparison framework DeepMind used — errors of commission versus errors of omission, tested against realistic open-ended queries rather than multiple choice — is worth adopting regardless of which model you’re evaluating. If you want a sense of how GPT-5.4 stacks up against other frontier models on general reasoning tasks, the GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmark comparison covers the broader landscape. For understanding how GPT-5.4 performs specifically in agentic and sub-agent contexts, the GPT-5.4 Mini vs Claude Haiku sub-agent comparison is relevant if you’re thinking about multi-step clinical workflows.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

The strategic question for medical AI in 2026 isn’t “which model is smarter.” It’s “which architecture can observe the patient, reason about what it observes, and adapt the examination in real time.” AI Co-clinician’s 33-point lead over GPT-5.4 thinking with search is largely a function of that architectural difference, not raw reasoning capability.

That’s a useful thing to know when you’re deciding where to invest your development time. If you’re building clinical documentation tools, model selection matters and GPT-5.4 is competitive. If you’re building anything that involves watching a patient, you’re in different territory — and the evaluation framework needs to reflect that.

One practical note for teams building clinical AI applications that need to go from specification to deployed product: tools like Remy take a spec-driven approach where you write your application as annotated markdown and compile it into a complete TypeScript stack with backend, database, auth, and deployment. That matters when your clinical AI spec needs to be auditable and your generated code needs to be real, not a scaffold.

The gap between AI Co-clinician and GPT-5.4 in physician preference is 33 points. The gap between a text-only architecture and a video-capable one in clinical settings is probably larger. Build accordingly.

The Anthropic vs OpenAI vs Google agent strategy comparison covers the broader strategic picture of where each lab is placing its bets on agent infrastructure — which is the relevant context for understanding why Google built AI Co-clinician as a video-first system rather than a text-first one with vision bolted on. And if you’re tracking where the GPT-5.4 vs Claude Opus 4.6 capability comparison lands on non-medical tasks, that’s a useful baseline for understanding what GPT-5.4 thinking with search brings to the table before you factor in the clinical-specific training that AI Co-clinician adds.

The 97/98 zero-critical-error result is the number that will age well or poorly depending on what happens in deployment. Everything else in this evaluation is a snapshot of where medical AI stands in mid-2026. That number is a claim about safety, and safety claims in medicine get tested by reality eventually.

Google AI Co-clinician vs GPT-5.4 Thinking: Which Medical AI Do Physicians Actually Prefer?

63% to 30%: What Physician Preference Tells You About Medical AI in 2026

The Test Was Designed to Be Hard to Game

Remy doesn't build the plumbing. It inherits it.

What GPT-5.4 Thinking With Search Can’t See

The Drug Knowledge Gap Is a Separate Problem

Day one: idea. Day one: app.

The 140-Dimension Assessment and What 68/140 Actually Means

What This Means If You’re Building Medical AI Applications

Coding agents automate the 5%. Remy runs the 95%.

The Honest Limitations

The Model Comparison You Should Actually Be Running

Everyone else built a construction worker.
We built the contractor.

Related Articles

How to Switch from ChatGPT to Claude Without Losing Your Context

AI Model Routers Compared: Bifrost, LiteLLM, Portkey & More

Choosing the Right AI Model for Text Generation

Cursor SDK + GPT-5.5 Scores 87.2% vs Native Codex's 61.5% — The Harness Is the Bottleneck

63% to 30%: What Physician Preference Tells You About Medical AI in 2026

The Test Was Designed to Be Hard to Game

Remy doesn't build the plumbing. It inherits it.

What GPT-5.4 Thinking With Search Can’t See

The Drug Knowledge Gap Is a Separate Problem

Day one: idea. Day one: app.

The 140-Dimension Assessment and What 68/140 Actually Means

What This Means If You’re Building Medical AI Applications

Coding agents automate the 5%. Remy runs the 95%.

The Honest Limitations

The Model Comparison You Should Actually Be Running

Everyone else built a construction worker.We built the contractor.

Related Articles

How to Switch from ChatGPT to Claude Without Losing Your Context

AI Model Routers Compared: Bifrost, LiteLLM, Portkey & More

Choosing the Right AI Model for Text Generation

Cursor SDK + GPT-5.5 Scores 87.2% vs Native Codex's 61.5% — The Harness Is the Bottleneck

Everyone else built a construction worker.
We built the contractor.