Harvard and Stanford Physicians Built the Toughest Medical AI Benchmark Yet — Here's How AI Co-Clinician Scored

Harvard and Stanford Physicians Built a 140-Dimension Medical AI Benchmark. Here’s What It Actually Measures.

When you want to know if a medical AI is genuinely good, you don’t run it through a multiple-choice test. You build 20 synthetic clinical scenarios with physicians from Harvard and Stanford, recruit 10 real doctors to role-play as patients, and then score the AI across 140 distinct dimensions of consultation skill. That’s exactly what Google DeepMind did with AI Co-clinician — and the methodology itself is worth understanding before you look at the numbers.

Most AI benchmarks are designed to be easy to run at scale. Multiple choice, closed-ended, automated grading. That’s fine for measuring whether a model knows facts. It’s not fine for measuring whether a model practices medicine well. The 140-dimension consultation assessment is a different kind of instrument, and if you build AI systems for high-stakes domains, the design choices here are instructive.

Why Standard Benchmarks Fail Medical AI

The standard approach is to take a medical licensing exam — USMLE, MedQA, something like that — and see if the model passes. Models have been “passing” those exams for a couple of years now. GPT-4 cleared the USMLE threshold in 2023. The problem is that passing a licensing exam tells you almost nothing about whether the model can actually conduct a consultation.

A licensing exam is closed-world. The question gives you the relevant information. Real medicine is open-world. The patient gives you whatever they feel like telling you, in whatever order, with whatever emotional state they’re in, and your job is to figure out what’s missing and go get it. The exam tests recall. The consultation tests process.

DeepMind’s team understood this distinction. Their evaluation framework was built around what they call errors of commission and errors of omission — adapted from the no harm framework used in academic medicine. Errors of commission: the AI said something wrong or harmful. Errors of omission: the AI failed to ask about something critical, and that silence could hurt the patient. In medicine, what you don’t say can matter as much as what you do.

Out of 98 realistic primary care queries, AI Co-clinician recorded zero critical errors in 97 of them. That’s the omission/commission framework in action — not just “did it answer correctly” but “did it leave anything dangerous on the table.”

What 140 Dimensions Actually Covers

The number 140 sounds like it was chosen to impress. It wasn’t. When you actually think through what a good medical consultation requires, 140 dimensions is not that many.

Consider what you’re trying to measure. There’s the history-taking layer: does the AI ask about onset, duration, severity, associated symptoms, relevant history, medications, allergies? That’s already a dozen dimensions before you’ve touched anything else. Then there’s the reasoning layer: does it generate an appropriate differential? Does it rank the differential correctly? Does it know which diagnoses to rule out first because they’re dangerous, not just because they’re likely?

Then there’s the communication layer: empathy, bedside manner, how it delivers bad news, whether it checks for understanding. Then red flag detection — does it recognize when something needs emergency escalation versus watchful waiting? Then physical exam guidance — can it adapt standard exam maneuvers to a telehealth context? Then follow-up question quality — does it ask the right second question after the patient answers the first one?

Stack all of that up and 140 starts to feel like a reasonable lower bound.

The 20 synthetic clinical scenarios were built specifically to stress-test this full range. The scenarios weren’t random — they were constructed to include cases where the obvious diagnosis is wrong, cases where the critical finding is something the patient doesn’t volunteer, and cases where the right answer is “this needs emergency care now” rather than “let’s monitor and follow up.” The Harvard and Stanford physicians who built them knew exactly which failure modes they were probing.

The Role-Playing Physicians Are the Key Design Choice

Here’s the part of the methodology that most coverage skips over: the 10 real physicians who role-played as patients weren’t just reading from a script. They were given specific instructions about what information to volunteer and what to withhold unless explicitly asked.

This matters enormously. If you let the patient volunteer everything, you’re not testing the AI’s ability to elicit information — you’re testing its ability to process information that’s already been handed to it. That’s a much easier problem. The hard problem is knowing what to ask for.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

In the rotator cuff scenario, the research scientist playing the patient was specifically told to downplay certain symptoms and not volunteer the full picture. The AI had to work for the diagnosis. And what the physician evaluator observed was instructive: the AI correctly identified that this was likely a rotator cuff issue and correctly recommended conservative treatment — rest, ice, physical therapy, no immediate MRI. But it only ran about three of the eight physical exam tests that would have been appropriate. The evaluator called this “premature closure” — a real clinical concept where a clinician settles on a diagnosis before fully ruling out alternatives.

That’s a meaningful finding. The AI got the disposition right but didn’t fully characterize the injury. In a real clinical setting, that gap matters — not because the patient gets the wrong treatment immediately, but because incomplete characterization can mean missed findings that change the picture later.

The AI Co-clinician matched or exceeded primary care physicians in 68 of those 140 consultation dimensions. Human physicians still won overall, particularly on red flag detection and critical exam guidance. DeepMind’s framing — supportive tool, not replacement — is accurate to what the data shows.

The Myasthenia Gravis Case Reveals What 140 Dimensions Is Really Testing

The most technically interesting case in the evaluation was the myasthenia gravis presentation. A Harvard physician role-played a patient with a drooping right eyelid. The AI spotted the droop via video and immediately asked about fatigability — whether symptoms were worse later in the day. That’s the key differentiating question between myasthenia gravis and Lambert-Eaton myasthenic syndrome, a similar condition that improves rather than worsens with activity.

Then the AI requested a sustained upward gaze test — asking the patient to look at the ceiling and hold that gaze for 30 seconds to observe whether the eyelid drooped further under sustained effort. The Harvard physician evaluator said they had never thought to perform that maneuver via telehealth. It’s a specialized neurological exam adapted for a video context.

This is what the 140-dimension framework is designed to surface. “Did the AI ask the right follow-up question” is one dimension. “Did the AI adapt a physical exam maneuver appropriately for the telehealth context” is another. “Did the AI correctly identify the key differentiating feature between two similar diagnoses” is a third. None of those show up in a multiple-choice benchmark. All of them showed up here.

The thought log — the AI’s internal reasoning trace — showed it explicitly noting that the fatigable nature of the ptosis and diplopia “strongly suggests conditions affecting neuromuscular junctions, prominently myasthenia gravis.” That’s not pattern matching on surface features. That’s mechanistic reasoning about pathophysiology.

The RXQA Benchmark: Where Most Medical AI Falls Apart

The 140-dimension consultation assessment is the centerpiece, but the evaluation also included the RXQA benchmark — built on open FDA data — which tests open-ended medication questions: drug interactions, dosing edge cases, the kind of ambiguous questions that real physicians actually ask.

Most medical AI benchmarks use multiple choice because it’s easy to score. RXQA uses open-ended questions because that’s how medicine actually works. A patient says “my mom has been on this medication for 10 years and now her ankles are swelling, is that a problem?” That’s not a multiple-choice question. That’s a clinical reasoning problem with incomplete information and no answer key.

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

AI Co-clinician surpassed every other frontier AI system on RXQA, including GPT-5.4 thinking with search — which it also beat 63% to 30% in blind physician preference evaluations. For context on where these frontier models sit relative to each other, the GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmark comparison gives you a sense of the competitive field AI Co-clinician was operating against.

The drug data problem is particularly hard because LLMs trained on the open web absorb a lot of noise about medications. FDA data is structured and authoritative, but it’s also dense and requires clinical context to interpret correctly. Winning on RXQA with open-ended questions means the model isn’t just retrieving drug facts — it’s reasoning about them in clinical context.

What This Methodology Teaches AI Builders

If you build AI systems for any high-stakes domain — legal, financial, medical, safety-critical — the DeepMind evaluation design is worth studying carefully.

The key moves are: (1) use domain experts to build the test cases, not generalists; (2) use real practitioners as evaluators, not crowdworkers; (3) test for omission failures, not just commission failures; (4) use open-ended evaluation, not multiple choice; and (5) build scenarios that specifically probe known failure modes rather than sampling randomly from the problem space.

This is harder and more expensive than running a model through a standard benchmark. It’s also the only way to get signal that actually predicts real-world performance. The gap between “passes USMLE” and “can conduct a safe primary care consultation” is enormous, and the only way to measure that gap is to build an evaluation that looks like the real task.

For teams building AI agents in complex domains, this is the evaluation design to copy. The 140-dimension framework isn’t specific to medicine — the underlying structure (history-taking, reasoning, communication, red flag detection, appropriate escalation) maps onto almost any expert consultation task. If you’re building a legal AI, your dimensions look different but the methodology is the same.

MindStudio handles the orchestration layer for this kind of multi-step agent work — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which matters when you’re trying to prototype evaluation pipelines without writing all the scaffolding from scratch.

The Premature Closure Problem Is the Honest Finding

The most honest result in this entire evaluation is the rotator cuff case. The AI got the disposition right. Conservative treatment was correct. No immediate MRI was correct. But it stopped the physical exam too early, ran three of eight appropriate tests, and missed impingement testing specifically.

Premature closure is one of the most common diagnostic errors in human medicine too. The physician evaluator noted that even experienced clinicians fall into it. The difference is that human physicians have years of training specifically designed to counteract it — checklists, differential diagnosis frameworks, attending supervision. The AI doesn’t have an equivalent corrective mechanism yet.

This is where the “supportive tool” framing is genuinely accurate rather than just legally cautious. An AI that gets the disposition right 97% of the time and misses some exam steps is useful alongside a physician who can catch the gaps. It’s not yet safe as a standalone primary care provider.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

The 68 out of 140 dimensions number is real progress. The remaining 72 dimensions where human physicians still lead are also real. Both things are true simultaneously, and the evaluation methodology is rigorous enough that you can actually trust both numbers.

What to Watch For Next

The evaluation framework DeepMind built here is arguably more valuable than the model itself. The 140-dimension rubric, the synthetic scenario construction methodology, the role-playing physician design — these are reusable instruments. If other labs adopt similar evaluation frameworks, you get comparability across systems. If everyone keeps running their own proprietary benchmarks, you get marketing.

The RXQA benchmark is already built on open FDA data, which means it’s reproducible. If it becomes a standard, that’s a meaningful step toward honest comparison across medical AI systems.

For builders working on multimodal AI — systems that process video, audio, and text simultaneously — the real-time video processing capability demonstrated here is worth tracking. The AI Co-clinician observes breathing, gait, and facial features through a camera and adapts its examination in real time. That’s a different class of capability than text-in, text-out medical AI. Google’s work on efficient multimodal models is relevant context here; what Google Gemma 4 actually is and how its Apache 2.0 open-weight model handles native audio and vision gives you a sense of where Google’s open-weight multimodal work sits relative to their frontier systems. And if you’re thinking about how frontier model architectures handle the compute efficiency required for real-time clinical inference, the Gemma 4 mixture of experts architecture breakdown is worth reading — running 26B parameters at 4B cost is the kind of efficiency gain that makes always-on clinical AI plausible.

The evaluation methodology is the thing to anchor on. When someone tells you their medical AI is good, ask them: good on what evaluation? Who built the test cases? Who did the evaluation? What failure modes did you specifically probe? If they can’t answer those questions with the specificity that DeepMind published here, the number they’re citing doesn’t mean much.

For teams building domain-specific AI agents — whether medical, legal, financial, or otherwise — the spec-driven approach to defining evaluation criteria is worth thinking about carefully. Remy, MindStudio’s spec-driven full-stack app compiler, takes a similar philosophy to software: you write annotated markdown that carries intent and precision, and the full-stack application is compiled from it. The parallel to evaluation design is real — the more precisely you specify what “good” looks like before you build, the more honest your results are when you measure it.

The gap between 68 and 140 is where the next few years of medical AI research will live. The evaluation framework to measure that gap now exists. That’s the actual news here.