How Google's AI Co-Clinician Uses Live Video to Guide Physical Exams — And What It Means for Telehealth Builders

What Google’s AI Co-Clinician Actually Does During a Video Call

Most telehealth tools are just video calls with a doctor on the other end. You describe your symptoms, they listen, they prescribe. The camera is basically a telephone with a face.

Google DeepMind’s AI Co-clinician does something different. It uses real-time multimodal video processing to observe breathing patterns, gait, and facial features — and then guides you through a physical exam through the camera, in real time, adjusting based on what it sees. Within a single consultation, it can notice an eyelid drooping before you mention it, track your arm’s range of motion as you lift it, and ask you to press on specific parts of your abdomen while watching your reaction.

That’s the thing worth understanding here. This isn’t a chatbot that happens to have a camera. It’s a system that treats the video feed as clinical data.

If you’re building anything in the telehealth, health-tech, or AI agent space, the architecture choices DeepMind made here are worth understanding in detail — because they reveal what’s actually hard about video-based AI consultation, and where the current limits are.

What the system is actually doing during a call

The core capability is low-latency video processing that runs continuously during the consultation. The AI isn’t just listening to what you say — it’s watching the video feed and forming observations that it feeds back into its reasoning.

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

In the acute pancreatitis demo, the AI noticed the patient appeared uncomfortable before the patient described their pain level. It logged that observation and used it to calibrate urgency. That’s not a language model trick — that’s the model reading the video frame and integrating it with the conversation history.

The physical exam guidance works in a specific loop:

The AI forms a differential diagnosis from the conversation
It identifies which physical exam maneuvers would reduce its diagnostic uncertainty
It instructs the patient to perform those maneuvers
It watches the video to observe the result
It updates its differential and decides what to test next

This is meaningfully different from a static checklist. In the rotator cuff case, the AI asked the patient to raise his arm forward and to the side simultaneously. The patient only did one. The AI noticed — from the video — that only one movement was performed, and asked for the second. That kind of real-time correction requires the model to be tracking what’s happening in the frame, not just waiting for the patient to report back verbally.

The three cases that show the range (and the limits)

DeepMind ran three demo cases that each stress-test a different part of the video exam capability. They’re worth walking through specifically because the physician commentary reveals what’s actually impressive versus what’s still rough.

Case 1: Acute pancreatitis

The patient complained of severe abdominal pain. The AI asked them to lie down for the exam — the patient said they couldn’t. The AI adapted and asked them to press just above the belly button while seated.

Here’s the specific clinical reasoning that impressed the physician evaluators: the AI started the palpation exam at the belly button (not painful) before moving to the epigastric region (where it hurts in pancreatitis). That’s standard best practice — establish a baseline before moving to the painful area — and the AI did it without being prompted.

It also tested for rebound tenderness, which is when pain intensifies when you release pressure rather than apply it. One physician evaluator noted this was a questionable call for telehealth — rebound is normally assessed with the patient lying flat — but acknowledged it was clinically appropriate reasoning.

The AI correctly identified the differential as appendicitis or pancreatitis and recommended emergency care, IV fluids, labs, and CT imaging. The physician evaluator said the next steps were completely right, even if they’d rank the two diagnoses differently in terms of probability.

Case 2: Myasthenia gravis

This is the most technically interesting case for video processing. The patient (a Harvard physician role-playing) had a visible right eyelid droop. The AI spotted it from the video feed and confirmed it in its internal reasoning log — but notably, it didn’t mention the droop until the patient brought it up first.

That’s a subtle but important observation from the evaluators: the AI may have seen the droop, but it waited for the patient to volunteer the symptom. Whether that’s appropriate clinical caution or a gap in proactive observation is an open question.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

What happened next was genuinely unusual. The AI asked the patient to look up at the ceiling and hold that gaze for 30 seconds. This is a sustained upward gaze test — it induces fatigue in the levator muscle, which is a key diagnostic sign for myasthenia gravis. The Harvard physician evaluator said they had never thought to perform this test via telehealth. It’s a maneuver neurologists use in-person, and the AI adapted it for a video setting.

The AI also asked the right differentiating question: does the double vision get worse as the day goes on? That distinguishes myasthenia gravis (worsens with fatigue) from Lambert-Eaton myasthenic syndrome (improves with activity). The physician evaluator called it a spot-on assessment.

Case 3: Rotator cuff

This case shows the clearest current limitation. The AI tracked the patient’s range of motion from the video, noted that he hesitated during the movement (not just that it was painful, but that he hesitated), and correctly concluded that conservative treatment — rest, ice, physical therapy, over-the-counter pain medication — was appropriate without immediate MRI.

The physician evaluator agreed with the treatment plan. But they flagged something specific: the AI demonstrated what clinicians call “premature closure.” It chose roughly three of the eight physical exam tests that would have been appropriate, and it skipped impingement testing entirely. Without impingement testing, you can’t confidently distinguish rotator cuff tendinitis from adhesive capsulitis, which has a different treatment path.

The AI got the acuity right (not an emergency, no surgery needed) and the treatment direction right. But it stopped the exam too early.

What “real-time video processing” actually requires architecturally

For builders, the interesting question is: what does it take to build something like this?

The AI Co-clinician is running on a multimodal model that processes video frames continuously alongside audio and text. The key technical requirements are:

Low latency. The model needs to respond to what it sees in the video within the conversational flow. If there’s a 10-second delay between the patient performing a movement and the AI acknowledging it, the exam breaks down. DeepMind explicitly describes this as a “real-time, low-latency video model.”

Persistent reasoning state. The AI maintains a running differential diagnosis and updates it with each new observation. The internal “thought log” that evaluators could review showed the model explicitly updating its reasoning — “the fatigable nature of ptosis and diplopia strongly suggests myasthenia gravis” — as new information came in.

Adaptive exam sequencing. The AI doesn’t follow a fixed script. It decides what to test next based on what it’s already seen. In the pancreatitis case, it adjusted the palpation location based on the patient’s response. In the rotator cuff case, it tracked which movements had been completed and which hadn’t.

Multimodal integration. The model is simultaneously processing video (what it sees), audio (what it hears), and text (the conversation history). The clinical observations it makes — “the patient hesitated,” “the eyelid is drooping,” “the patient appears uncomfortable” — come from the video, not from what the patient says.

This is a different architecture than most current AI agents, which are primarily text-in, text-out with occasional image attachments. If you’re thinking about building something in this space, the video processing capability is the hard part — and it’s what separates this from a well-prompted chatbot. Platforms like MindStudio handle the orchestration layer for multi-model AI workflows — 200+ models, 1,000+ integrations, and a visual builder for chaining agents — which is useful context when you’re thinking about how to compose these capabilities without building the infrastructure from scratch.

The real-time video interaction space is moving fast beyond clinical settings too. Tools like Pika Me are already enabling video-based AI interactions at the consumer layer — the real-time video chat AI agent model is becoming a standard interaction pattern, even where the clinical reasoning layer isn’t present. Understanding what DeepMind built helps you understand what the next generation of these tools will need to do. It’s also worth noting how the underlying model capabilities are evolving at the edge: running multimodal models locally on phones is becoming practical, which has direct implications for telehealth apps that can’t rely on cloud latency.

Where the benchmarks actually come from

The evaluation methodology matters here because it’s more rigorous than most medical AI benchmarks.

DeepMind worked with physicians at Harvard and Stanford to build 20 synthetic clinical scenarios. Ten real physicians role-played as patients. The AI ran the consultations and was assessed across 140 consultation dimensions — not just diagnostic accuracy, but empathy, bedside manner, red flag detection, follow-up question quality, and physical exam guidance.

The AI Co-clinician matched or exceeded primary care physicians in 68 of those 140 dimensions. Human doctors still won overall, particularly on red flags and critical exam guidance. DeepMind’s own framing is that this is a supportive tool, not a replacement — and the rotator cuff case (premature closure, missed impingement testing) is a concrete example of why that framing is honest rather than just defensive.

On the safety side: out of 98 realistic primary care queries, the AI recorded zero critical errors in 97 cases. The evaluation used what’s called a “no harm framework” that tests for both errors of commission (saying something wrong) and errors of omission (leaving out something critical). In medicine, what an AI doesn’t say can matter as much as what it does.

In blind physician preference evaluations, 67% of physicians preferred the AI Co-clinician over existing clinician-facing tools. In a head-to-head against GPT-5.4 thinking with search, the AI Co-clinician won 63% to 30%. That’s a meaningful gap against a strong baseline.

The RXQA benchmark — built on open FDA data — tested open-ended medication questions: drug interactions, dosing edge cases, the kind of messy real-world queries that multiple-choice benchmarks don’t capture. The AI Co-clinician surpassed every other frontier AI system on this benchmark. Drug reasoning is historically where medical AI falls apart, because training data from the open web is unreliable for pharmacology. The FDA data grounding appears to make a real difference.

What this means if you’re building in telehealth

The honest answer is that the video physical exam capability is not something you can replicate today with off-the-shelf tools. It requires a multimodal model with real-time video processing, low latency, and persistent reasoning state — and those capabilities are only just becoming available at the infrastructure level.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

What you can build now is the surrounding workflow: intake, triage, symptom history, medication review, follow-up scheduling, care coordination. These are high-value problems that don’t require real-time video processing, and they’re where most of the current opportunity is.

The video exam capability is a preview of where the ceiling is. Understanding what DeepMind built helps you understand what the next generation of telehealth tools will need to do. For builders thinking about consistent UI and design systems across clinical workflow tools — where the requirements are precise and the implementation needs to match them exactly — the approach Google Stitch takes with structured design files is worth understanding: how Google Stitch’s design.md file works maps directly to the kind of spec-driven thinking that clinical software demands.

For builders thinking about the application layer on top of these capabilities: the spec-driven approach matters more than it might seem. When you’re building a clinical workflow tool, the requirements are precise — edge cases, data types, safety rules — and the implementation needs to match them exactly. Remy takes a similar philosophy to a different domain: you write an annotated spec in markdown, and it compiles into a complete TypeScript backend, database, auth, and deployment. The spec is the source of truth; the code is derived output. That kind of precision-first approach is the right mental model for clinical software, even if the specific tool isn’t clinical.

The real question the rotator cuff case raises

The premature closure finding in the rotator cuff case is the most useful thing in the entire evaluation — more useful than the benchmark wins.

The AI got the treatment direction right. It got the acuity right. But it stopped examining too early, and it didn’t know it had stopped too early. That’s a specific failure mode: the model doesn’t have a reliable signal for when its diagnostic confidence is actually warranted versus when it’s just run out of tests it knows to ask for.

Human physicians have a similar problem — premature closure is one of the most common diagnostic errors in medicine — but they have years of supervised practice that calibrates when to keep digging. The AI doesn’t have that calibration yet.

The 68 out of 140 dimensions number is real progress. The rotator cuff case is a reminder of what the remaining 72 dimensions look like in practice.

For anyone building AI-assisted clinical tools, that’s the design constraint worth holding onto: the system needs to know what it doesn’t know, and it needs to surface that uncertainty to the human in the loop rather than paper over it with a confident recommendation.

The video exam capability is genuinely new. The question of how to make it reliably safe is still open.