Transcript
A (0:01)
Senior health leaders at Google DeepMind have released a blog post and technical report about how they're enabling a new model for healthcare with AI co clinician. Whenever DeepMind releases something with these kinds of claims, we need to take it seriously. So what's it all about? So the vision is something like this. A patient logs into a telehealth portal instead of a human doctor. A system powered by artificial intelligence greets them via live video. The AI asks about their symptoms, watches them perform a guided physical examination over the camera, and assesses their range of motion and delivers a diagnosis. This scenario is the focus of Google DeepMind's latest technological reports and their AI co clinician. Processing real time audio and video to conduct a medical consultation represents a massive leap in technical capability. Analyzing the methodology though, and the clinical mechanics of this demonstration reveals that the exact boundary between processing data and practicing medicine, a boundary that will define the next decade of healthcare technology.
B (0:57)
Google outline how the global healthcare system faces a well documented workforce shortage. The World Health Organization predicts a shortfall of over 10 million health workers by 2030. Technology companies view artificial intelligence as a primary mechanism to bridge that gap. DeepMind's recent announcements transition their focus from text based models like MedPalm and AMI to multimodal systems. The AI co clinician uses the capabilities of the Gemini family of models and Project Astra to ingest continuous streams of audio and visual data. The system relies on a dual agent architecture. The first agent, the Talker, acts as the primary patient interface. It manages low latency communication, interprets immediate audio visual cues, and maintains a conversational flow with the patient.
A (1:46)
The second agent, the clinical planner, operates in the background. It functions as a supervisory module, tracking symptoms, managing the differential diagnosis and injecting specific clinical goals like prompting a guided physical examination to the talker's workflow. This architecture aims to solve a known problem for conversational AI. Forcing a single large language model to generate empathetic dialogue while simultaneously computing complex diagnostic reasoning often degrades performance in both of the areas. Separating the conversational interface from the clinical reasoning engineering is an elegant technical solution. The results of this architectural choice will become clear when examining the study data. So the technical report outlines a randomized interface blinded crossover simulation study. The evaluation involved 120 telemedical encounters based on 20 standardized outpatient scenarios. The AI co clinician was compared against human primary care physicians, a baseline AI without the Planner module and OpenAI's GPT real time. The performance was graded using case specific rubrics and universal clinical skills assessments known as tele paces, the data shows significant technological progress. The AI co clinician approached primary care physician performance in generating differential diagnoses and management plans. It outperformed GPT real time across all metrics. The dual agent architecture proved important. The ablation study showed that removing the clinical planner caused performance to drop significantly across history taking and reg flag detection. However, the AI system fell notably short of human physicians in two important the physical examination and the identification of red flags. Unpacking these specific clinical failures provides the most important, valuable insight into the current state of multimodal medical AI. Examining in more detail the evaluation methodology and the recorded interactions highlights several important clinical realities. First, the evaluation used internal medicine residents acting as the patients. These patient actors were portraying textbook stereotypical presentations of diseases. The physicians portraying the patients knew exactly what a stereotypical set of answers should be. Evaluating an AI on classic textbook cases is extremely safe territory for them. Language models are fundamentally designed to excel at pattern matching against standard medical literature. The setup involves general physicians describing a textbook case and another general physician grading the AI's response to that textbook case. This creates a circular validation loop that plays directly to the inherent strengths of a large language model rather than testing them the real world complexity of medicine. Second, the clinical technique demonstrated by the AI reveals a lack of true medical training. In one recorded interaction, the AI asks a compounded question. It asks the patient if they have changes in pupil size, double vision, and pain all in a single sentence, eliciting multiple distinct symptoms simultaneously. Is poor clinical practice known to confuse patients and yield inaccurate histories? A trained physician answering this question may be able to process those three things all at once, but patients in the real world would struggle. Third, the physical examination attempts reveal a system operating without an actual understanding of physical reality. During a case involving abdominal pain and suspected pancreatitis, the AI attempted to guide an abdominal examination while the patient was sitting completely up upright. Palpating an abdomen in a seated position contradicts basic physical examination principles taught in medical school. If I ever did this in the jobs when I worked in the emergency department, surgery, gastroenterology, I'd have been rightly told off. In another scenario detailed in the technical report, the AI instructed a patient to follow my finger to test eye movements. The system does not possess a finger it hallucinated a physical capability because follow my finger is the statistically probable next token in a transcript of a neurological examination. The most revealing insight, though, comes from the video demonstration of a patient presenting with myasthenia gravis. The system successfully asks the patient to look to the camera to check for a drooping eyelid known as ptosis. The narrators praise the AI for correctly identifying the droop, but in the video the physician actor is voluntarily lowering the eyebrow and squinting.
