Google DeepMind's AI Co-clinician Tops the RXQA Drug Knowledge Benchmark — Beating Every Frontier Model

The Drug Knowledge Problem That Has Quietly Broken Medical AI

Google DeepMind published results this year showing their AI Co-clinician surpassed every other frontier AI system on the RXQA benchmark — a test built from open FDA data that asks open-ended drug questions the way doctors actually ask them. Not multiple choice. Not “which of the following is a contraindication.” Real questions, with messy patient context, the kind that begin with “my mom has been on this for ten years and now her ankles are swelling.”

That result matters more than it looks. Drug knowledge is where medical AI has historically collapsed.

You can train a model to ace USMLE-style questions. Those are pattern-matching exercises dressed up as clinical reasoning. RXQA is different. It’s built from open FDA data — the actual pharmacological record — and the questions are open-ended, which means the model has to generate an answer, not select one. That’s a fundamentally harder task, and it’s the task that corresponds to real medicine.

The reason most LLMs fail here is structural. The open web is full of drug information, but it’s also full of wrong drug information, outdated drug information, and drug information written for patients rather than clinicians. A model trained on that corpus will confidently tell you something that was true in 2018 and dangerous in 2026.

Why Drug Knowledge Is the Hardest Part of Clinical AI

The benchmark result is the headline, but the underlying problem is worth understanding precisely.

Remy doesn't write the code. It manages the agents who do.

AGENTS ASSIGNED TO THIS BUILD

Remy

Product Manager Agent

Leading

Design

Engineer

Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Medical AI has two distinct failure modes. Errors of commission: the model says something false. Errors of omission: the model leaves out something critical. In most domains, errors of omission are annoying. In pharmacology, they can kill someone. The drug interaction you didn’t mention, the contraindication you glossed over, the dosing edge case you skipped because the question seemed routine — these are the failure modes that matter.

DeepMind’s evaluation framework explicitly tested for both. Across 98 realistic primary care queries, the AI Co-clinician recorded zero critical errors in 97 of 98 cases. That’s not a number any previous medical AI system has hit at this scale. The 98th case is worth noting too — one error in 98 queries is not a catastrophe, but it’s a reminder that “zero critical errors” is a ceiling, not a floor.

The RXQA benchmark adds a second layer of difficulty. Most medical benchmarks are multiple choice, which means a model can score well by eliminating implausible options. Open-ended questions require the model to construct an answer from scratch, which surfaces gaps that multiple-choice formats hide. When you ask a model “what are the drug interactions I should know about before prescribing this to a 70-year-old with renal impairment,” there’s no answer key to reverse-engineer. The model either knows or it doesn’t.

The AI Co-clinician didn’t just edge ahead on RXQA. It surpassed all other frontier systems, including GPT-5.4 thinking with search — the model that most people currently pay $20 a month for. That gap is meaningful because GPT-5.4 with search has access to real-time information retrieval. The Co-clinician beat it anyway, specifically on the open-ended questions that most closely resemble actual clinical practice. For a deeper look at how GPT-5.4 stacks up against other frontier models across different task types, this comparison of GPT-5.4 and Claude Opus 4.6 is worth reading alongside these results.

What the Benchmark Is Actually Measuring

RXQA is built from open FDA data, which is a deliberate choice. FDA drug labeling is the authoritative record — it’s what manufacturers are legally required to disclose, and it’s updated when new safety information emerges. A model that can reason accurately over FDA data is a model that’s working from the right source, not from a forum post or a ten-year-old review article.

The open-ended format is the other key design decision. Clinical pharmacology questions don’t arrive as multiple-choice problems. They arrive as: “this patient is on warfarin and just started fluconazole, should I be worried?” or “the patient says she’s been taking this OTC supplement for her joints, does that interact with anything?” The question is embedded in patient context, the relevant variables aren’t labeled, and the answer requires synthesis rather than recall.

This is where the Co-clinician’s architecture matters. It’s a real-time, low-latency video model — it watches the patient through a camera, guides physical exams, and processes information across modalities simultaneously. That’s not a chatbot with a medical knowledge base bolted on. The drug knowledge is integrated into a system that’s already reasoning about the patient in front of it.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The distinction shows up in the benchmark results. When questions were posed the way doctors actually ask them — open-ended, ambiguous, with patient context — the Co-clinician’s advantage over other frontier models was most pronounced. On cleaner, more structured questions, the gap narrowed. That’s exactly what you’d expect from a system designed around clinical context rather than benchmark optimization.

It’s also worth noting that the frontier model landscape has been shifting rapidly. Models like Claude Mythos are pushing hard on reasoning benchmarks, and understanding where the Co-clinician sits relative to that moving target requires paying attention to what each benchmark is actually measuring — not just who topped the leaderboard this month.

The Broader Evaluation Picture

The RXQA result doesn’t exist in isolation. DeepMind ran a parallel evaluation that’s worth understanding in full, because it contextualizes what the drug knowledge benchmark actually means in practice.

They built 20 synthetic clinical scenarios with physicians from Harvard and Stanford. Then they recruited 10 real physicians to role-play as patients. The AI Co-clinician ran the consultations. Afterward, the evaluators assessed the consultations across 140 dimensions — not just diagnostic accuracy, but empathy, red flag detection, follow-up question quality, physical exam guidance, and bedside manner.

The Co-clinician matched or exceeded primary care physicians in 68 of those 140 dimensions. Human physicians still won overall, particularly on red flags and complex exam guidance. DeepMind’s stated conclusion is that this is a supportive tool, not a replacement. That framing is accurate and responsible. But 68 out of 140 is not a rounding error. It’s a system that has closed a meaningful portion of the gap.

The blind physician preference test adds another data point. Doctors compared the AI Co-clinician against existing clinical tools they already use — tools they chose, tools they’re familiar with. The Co-clinician won 67% to 26%, with 5% neutral. Then they ran the same test against GPT-5.4 thinking with search. The Co-clinician won 63% to 30%.

These aren’t benchmark numbers. These are physicians, in a blind test, saying they preferred one system over another. That’s a different kind of evidence.

Three Demos That Show Where the Drug Knowledge Connects to Practice

The benchmark results are abstract until you see what they correspond to in actual consultations.

In the acute pancreatitis demo, a physician role-played a patient with severe abdominal pain. The AI correctly identified epigastric tenderness through a guided physical exam — asking the patient to palpate above the belly button, not at it, which is the anatomically correct location for pancreatitis. It tested for rebound tenderness. It correctly identified the differential as appendicitis or pancreatitis and recommended emergency care. The physician evaluator noted that the next steps — emergency care, vitals, rehydration, labs, CT scan — were exactly right, even if the probability ranking between the two diagnoses could have been sharper.

TIME SPENT BUILDING REAL SOFTWARE

95%

5% Typing the code

95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

In the myasthenia gravis case, a Harvard physician role-played a patient with a drooping eyelid. The AI spotted the ptosis visually through the camera, asked about fatigability (specifically whether symptoms were worse as the day progressed, which distinguishes myasthenia gravis from Lambert-Eaton syndrome), and then asked the patient to sustain upward gaze for 30 seconds. The Harvard physician evaluator said he had never thought to do that exam maneuver via telehealth. The AI’s thought log correctly identified “fatigable ptosis and diplopia” as strongly suggesting a neuromuscular junction condition, with myasthenia gravis as the primary concern.

In the rotator cuff case, the AI guided a range-of-motion exam via video, noticed the patient hesitated during the movement, and correctly assessed that conservative treatment — rest, ice, ibuprofen, physical therapy — was appropriate without immediate imaging. The physician evaluator noted the AI didn’t run all eight relevant tests, but that the acuity assessment and treatment recommendation were correct.

What connects these three cases to the RXQA result is the same underlying capability: the system is reasoning about the patient in context, not pattern-matching to a template. Drug knowledge works the same way. The question isn’t “what are the contraindications for drug X” in the abstract. It’s “what are the contraindications for drug X given this patient’s age, comorbidities, and current medication list.” That’s the question RXQA is designed to test, and it’s the question the Co-clinician is winning on.

What This Means for Anyone Building on Top of Medical AI

The RXQA result has a specific implication for builders, not just researchers.

Drug knowledge has been the weakest link in medical AI deployments. Teams building clinical decision support tools, medication review systems, or patient-facing health applications have had to build elaborate guardrails around pharmacology because the underlying models weren’t reliable enough to trust. The standard approach has been to restrict the model’s scope — don’t let it answer drug questions, route those to a database lookup, flag everything for human review.

That approach works, but it limits what you can build. A system that can’t reason about drug interactions can’t do a real medication review. It can surface a list of interactions from a database, but it can’t synthesize that list in the context of a specific patient’s situation and tell you which ones actually matter.

If the Co-clinician’s RXQA performance holds up under broader deployment — and that’s a real if, because benchmark performance and real-world performance diverge in medicine more than almost any other domain — it suggests the guardrails can be loosened. Not removed. Loosened. The difference between “don’t answer drug questions” and “answer drug questions with appropriate uncertainty and flag edge cases for review” is the difference between a limited tool and a useful one.

For teams building multi-model workflows, this is where model selection starts to matter in a concrete way. MindStudio is an enterprise AI platform with 200+ models and 1,000+ integrations, and a visual builder for orchestrating agents and workflows — which means you can route drug-related queries to the model with the strongest pharmacology performance while routing other tasks to models optimized for different capabilities, without writing the orchestration code yourself. The same principle applies when you need to swap in a newer model as the benchmark landscape shifts.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

The broader question for builders is how to structure the knowledge that feeds these systems. The RXQA benchmark is built from FDA data, which is authoritative and structured. That’s not an accident. The Co-clinician’s advantage on open-ended drug questions likely reflects, at least in part, training on better-structured pharmacological data rather than scraped web content. If you’re building a clinical application, the quality of your knowledge base matters as much as the model you’re running on top of it. This is the same logic that drives spec-driven development tools — when Remy compiles an application from an annotated markdown spec into a complete TypeScript app with backend, database, auth, and deployment, the precision of the spec determines the quality of the output. The source of truth is the constraint. In medical AI, the FDA label is that source of truth, and systems that treat it as such perform differently than systems that treat the open web as equivalent.

The open-weight model ecosystem is also relevant here. Teams that want to run pharmacology-focused models on their own infrastructure — for compliance reasons, latency reasons, or cost reasons — are increasingly able to do so. Comparing open-weight options like Gemma 4 31B and Qwen 3.5 for agentic workflows is a reasonable next step for anyone building clinical tooling who can’t route everything through a proprietary API.

The One Failure Mode That Benchmarks Don’t Capture

There’s a result in the evaluation that deserves more attention than it’s gotten.

In the myasthenia gravis demo, the AI’s thought log showed it had noticed the patient’s drooping eyelid. But it didn’t say so until after the patient mentioned it. The Harvard physician evaluator raised the obvious question: did the AI actually see it proactively, or did it only process the visual information after the verbal cue?

This matters for drug knowledge too, in a different way. A model can score well on RXQA by correctly answering the questions it’s asked. What RXQA doesn’t test is whether the model proactively surfaces the drug information the clinician didn’t think to ask about. The interaction that wasn’t mentioned. The dosing adjustment the patient’s renal function requires, even though nobody asked. The contraindication that’s relevant because of a comorbidity buried in the history.

Errors of omission are harder to benchmark than errors of commission. You can test whether a model says something false. It’s harder to test whether a model failed to say something true that it should have said.

DeepMind’s no harm framework explicitly tested for both, which is why the 97/98 zero critical error result is meaningful — it wasn’t just checking for false claims. But the myasthenia gravis observation is a reminder that proactive reasoning is a different capability than reactive reasoning, and the gap between them is where the most dangerous clinical failures live.

The Co-clinician’s RXQA result is real. The 67% physician preference in blind testing is real. The 68 out of 140 dimensions matching or exceeding primary care physicians is real. What’s also real is that the system is relatively new, the evaluations were conducted under controlled conditions, and the question of how it performs in the full distribution of clinical encounters — including the ones nobody thought to include in the benchmark — is still open.

That’s not a reason to dismiss the results. It’s a reason to read them carefully, which is exactly what the DeepMind team seems to want. Their conclusion — supportive tool, not a replacement — is the right conclusion. The interesting question is how long that qualifier holds.

The gap between “passes the benchmark” and “ready for clinical deployment” has always been wide in medicine. What’s different now is that the benchmark itself has gotten harder. RXQA is not a multiple-choice test. The 140-dimension consultation assessment is not a vibes check. The blind physician preference study is not a marketing exercise.

How Remy works. You talk. Remy ships.

YOU14:02

Build me a sales CRM with a pipeline view and email integration.

REMY14:03 → 14:11

Scoping the project

Wiring up auth, database, API

Building pipeline UI + email integration

Running QA tests

✓ Live at yourapp.msagent.ai

When the hardest tests start producing results like these, the gap is narrowing. Whether it narrows fast enough to matter — and for whom — is the question worth watching.