Summary5 min read

The Health AI Brief

Episode: AI Just Beat Harvard Doctors?

Host: Stephen A
Date: May 4, 2026

Episode Overview

This episode dissects a landmark study where OpenAI’s O1 large language model (LLM) outperformed human physicians—including Harvard doctors—on several complex diagnostic and management challenges. Host Stephen A walks through the methodology, remarkable results, and crucial caveats, offering vital high-yield insights for clinicians navigating the integration of AI in real-world medical practice.

Key Discussion Points & Insights

1. The Study’s Context & Significance

Historical Benchmark: The New England Journal of Medicine’s Clinicopathological Conferences (CPCs) have been the gold standard for testing diagnostic excellence ([00:00]-[01:00]).
Major Shift: New Science publication by Peter Broder, Thomas Buckley (Harvard, Stanford, MIT) evaluates OpenAI’s O1 model against hundreds of human physicians.
Striking Results: “The results indicate that the model not only matched but frequently exceeded physician performance in several key areas, including real world emergency department scenarios.” ([01:10])

2. Breaking Down The Experiments

Multi-Phase Design: Five experiments, including:
- NEJM CPCs (143 cases from 2021–2024): O1 included the correct diagnosis in its differential 78.3% of the time ([01:40]).
- Benchmark Head-to-Head: On 101 cases with human physician benchmarks, O1 surpassed humans in both “top one” and “top ten” diagnosis accuracy.
- Other Tasks Assessed:
  - NEJM Healer Curriculum & Grey Matters Management: O1 delivered a median score of 89% on Grey Matters, compared to GPT-4 at 42% and humans at 34% ([03:00]).

3. Real-World ED Testing: Triage and Admission

Study Population: 80 unselected, real-world emergency department patients, all admitted to general medicine or ICU ([04:00]).
Touchpoints Assessed:
- Triage Stage: O1 identified correct/near-correct diagnosis in 67.1% of cases, higher than two ER attendings at 55.3% and 50%.
- Admission Stage: O1’s accuracy rose to 81.6%.
Key Outperformance: O1 was “significantly better at diagnosing patients when information is most scarce at the very definition of the triage challenge.” ([06:00])

4. Important Methodological Caveats

Study Population Bias: All 80 patients were true positives for severe illness—no non-urgent, “rule out” cases included.
- “The study population consisted entirely of true positives for significant illness. It didn’t include the vast majority of patients…subsequently discharged home.” ([07:40])
Prompt Engineering: O1 was asked to focus on life-threatening diagnosis; this explicit instruction likely inflated accuracy ([08:30]).
- “When you provide an LLM with a cohort of patients…explicitly prompt that model to look for life threatening emergencies, the model is being steered towards the correct answer by the study design itself.”
Real-World Implications: Unclear how O1 would behave with thousands of unselected ED patients—risk of overdiagnosis remains untested ([09:50]).
- “If an AI tool consistently suggests rare life threatening conditions for every patient…it could lead to significant over diagnosis…”

5. AI Reasoning vs. Human Judgment

Physician Cognitive Framework: Emphasizes judicious exclusion (“when not to act”) as much as detection; O1 does not address this nuance in study design.
Metrics Nuance: The “Bond score” defined success stringently (only 4 or 5 out of 5), but a 3 often represents real-world clinical utility ([12:10]).
Potential Overlook: “By excluding these from the high performance bracket, the study potentially sets a very high bar, but it may also obscure the practical utility…”

6. Study Strengths

Data Contamination Controls: Authors checked for model training exposure to cases—no significant difference found, strengthening validity ([14:00]).
Logistical Feat: “It represents a major step forward from testing AI on cleaned up data sets…”

7. Key Takeaways for Clinical Practice

Practical Potential: O1 now demonstrates “elite” clinical reasoning, especially for sickest patients and management plans ([15:10]).
But…
- Not yet ready to replace clinicians for front-door triage across all patients.
- High need for prospective, broad clinical trials to assess overdiagnosis, workflow impact, costs, and psychological factors.
- Collaborative Intelligence as the Goal: “Maybe the AI provides an exhaustive cannot miss list and the human physician provides the judiciousness to decide which of these possibilities warrants action.” ([18:20])

Notable Quotes & Memorable Moments

On the Model’s Leap:
“Where previous generations of AI struggled with the complex multi step reasoning required to solve a CPC, the O1 model’s internal chain of thought processing appears to have bridged that gap.” ([02:40])
On Real-World Testing:
“These patients were not curated vignettes, they were real people presenting with unstructured data.” ([04:20])
On Overdiagnosis Dangers:
“If an AI tool consistently suggests rare life threatening conditions for every patient… it could lead to significant over diagnosis and a cascade of unnecessary, expensive and potentially harmful tests.” ([09:55])
On Study Implications:
“The study highlights the urgent need for prospective trials… to measure the broader impact on the health system, including the costs of possible over testing and the psychological impact on clinicians and patients.” ([17:10])
On the Future Promise:
“The study is an optimistic signal that we’re entering an era where AI may one day be able to handle some of the heavy lifting of data synthesis, potentially freeing up clinicians to focus upon the nuances of patient care.” ([18:40])

Timestamps for Important Segments

00:00 – Introduction & NEJM CPCs as Benchmark
01:40 – Description of Five-Phase Study, First Results
03:00 – AI vs Human Scores (Grey Matters, Healer Curriculum)
04:00 – Real-World Emergency Dept Patient Testing
06:00 – AI Outperforms on Scarce Info
07:40 – Limitations: Only ‘True Positive’ Patients
08:30 – Prompt Engineering & Bias Discussion
09:50 – Overdiagnosis & Real-World Risk
12:10 – Evaluation Metrics & Bond Score
14:00 – Data Contamination Controls
15:10 – Management Plan Superiority
17:10 – Need for Broad Clinical Trials
18:20 – Collaborative Intelligence Vision

Bottom Line

OpenAI’s O1 LLM is breaking new ground—outperforming human experts in certain controlled medical diagnostics and management tasks. However, current evidence is limited by study design, small and selected patient populations, and potential overdiagnosis risk. The future will hinge on broad, prospective real-world testing and a collaborative clinician-AI model, leveraging the strengths of both machine logic and nuanced human judgment.

Loading summary

Transcript1 lines

[00:01]
A
Is AI now better than human clinicians at diagnosing real patients? The clinical diagnosis has long been considered the ultimate test of human cognition in clinical medicine. For over 60 years, the New England Journal of Medicine's Clinicopathological Conferences, or CPCs, have served as the gold standard for evaluating diagnostic excellence. These cases are notoriously difficult, often involving rare diseases or complex presentations designed to challenge the world's leading medical minds. A new study published in Science by a multi institutional team including Peter Broder and Thomas Buckley and colleagues from Harvard, Stanford and MIT suggest that there's been a significant shift in this landscape. Their research evaluates the performance of OpenAI's O1 series, a large language model designed with enhanced reasoning capabilities against hundreds of human physicians across a range of diagnostic and management tasks. The results indicate that the model not only matched but frequently exceeded physician performance in several key areas, including real world emergency department scenarios. However, while the headline results are striking, a closer analysis of the methodology reveals some important nuances regarding how these tools might actually function in a messy, unselected clinical environment. The study architecture is very comprehensive, spanning five distinct experiments. The first phase focused on the aforementioned New England journal of Medicine CPCS evaluating 143 cases published between 2021 and 2024. The O1 preview model included the correct diagnosis in its differential 78.3% of the time. For comparison, when looking at a subset of 101 cases where human physician benchmarks were available, the model outperformed the human baseline in both naming the top one and the top 10 diagnosis accuracy. This is a significant finding because these cases represent benchmark saturation for older AI models. Where previous generations of AI struggled with the complex multi step reasoning required to solve a CPC. The the O1 model's internal chain of thought processing appears to have bridged that gap. The researchers also tested the model on the New England Journal of Medicine Healer Curriculum and Grey matters Management cases. In these tasks, which require not just a diagnosis but a viable plan for next steps in care, the model's performance was even more pronounced on the grey matter cases, which were scored against a consensus of 25 physician experts. O1 preview achieved a median score of 89%. This is substantially higher than GPT4 alone at 42%, and notably higher than physicians with access to conventional resources who scored a median of 34%. These results are impressive. Yet the most impactful part of this research, and the one generating the most discussion, is the evaluation of the model on real world patients in an emergency department setting. The team selected 80 patients who presented to a major tertiary academic medical center in Boston. These patients were not curated vignettes, they were real people presenting with unstructured Data. The study compared 01 Preview and GPT4O against the performance of attending physicians at three distinct diagnostic touchpoints. First, initial triage, second, the emergency room physician encounter and third, at the point of admission to either a medical floor or intensive care unit. In this real world cohort, the O1 model identified the correct or very close diagnosis in 67.1% of cases at the triage stage at the front door, rising to 81.6% at the time of admission. This outperformed the attending physicians whose accuracy was measured at 55.3% and 50% at the triage stage for the two physicians in the study. On the surface then, this suggests that AI might be significantly better at diagnosing patients when information is most scarce at the very definition of the triage challenge, however, an understanding of the study's inclusion criteria is necessary to understand the limits of this claim. The 80 patients selected for the study were all eventually admitted to the General medicine service or medical intensive care unit. This means that the study population consisted entirely of true positives for significant illness. It didn't include the vast majority of patients who present to the emergency department. Front door triage are evaluated and subsequently discharged home because their condition isn't life threatening or requires only minor intervention. Furthermore, the prompting strategy used for the model is a very important variable. The O1 preview model was specifically instructed to list potentially life threatening diagnoses that require immediate management. When you provide an LLM with a cohort of patients who are by definition unwell enough to require inpatient care and then explicitly prompt that model to look for life threatening emergencies, the model is being steered towards the correct answer by the study design itself. The physicians by contrast, usually are operating in a different cognitive framework. In a real triage environment, the clinician's task is not just to identify the cannot miss diagnosis, but to filter the high stakes cases from the high volume noise of non urgent presentations. This requires judiciousness, the ability to decide when not to act and when a serious diagnosis is statistically unlikely despite a concerning symptom. Because the study only looked at patients who were actually sick, we don't know how many false positives the model would have generated if applied to a thousand unselected patients in the waiting room. If an AI tool consistently suggests rare life threatening conditions for every patient with chest pain or a headache, it could lead to significant over diagnosis and a cascade of unnecessary, expensive and potentially harmful tests. While the model was highly accurate at identifying the cannot miss diagnoses in these sickids patients, the study doesn't address the risk of crying wolf for the broader population. It's a vital distinction for any clinician or manager considering the integration of AI into a workflow action is essential in the intensive care unit, but in triage the ability to de escalate and avoid unnecessary intervention is equally and sometimes more important. There's also a slight methodological curiosity regarding the comparison between large language models. GPT4O was seemingly given a different prompt to 01 preview one that was not as heavily biased towards emergency cases. It's unclear why the prompts weren't standardised across the models as this makes it difficult to determine if O superior performance is due to its superior reasoning or simply due to a more targeted set of instructions. The evaluation metrics also warrant scrutiny. The study used the Bond score, a one to five scale where five means that it got the exact diagnosis and one is completely incorrect. The researchers defined a score of four or five as being success. A score of three, however, represents a diagnosis that's closely related to the correct one, for example listing viral pneumonia when the patient actually has COVID 19 pneumonia. In many clinical settings a score of 3 is very useful and represents a safe functional differential. By excluding these from the high performance bracket, the study potentially sets a very high bar, but it may also obscure the practical utility of both the humans and the models. Despite these caveats, the logistical feat achieved by the team are Beth Israel Deaconess and their collaborators is is remarkable. Setting up a blinded head to head trial between AI and attending physicians using real time unstructured clinical data is incredibly difficult. It represents a major step forward from testing AI on cleaned up data sets that have often been seen by models during their training phase. For the cases, the authors did explicitly check for this sort of data contamination by comparing performance on cases published before and after the model's training cutoff date. They found no significant difference. Although there are quite small ISH numbers with quite high variance, this does strengthens the validity of their findings on the CPCS. What this study demonstrates most clearly is that LLMs have reached a level of reasoning that allows them to navigate complex medical information with a high degree of technical accuracy. The O1 model's ability to outperform physicians on management plans where it has to weigh the risks and benefits of various tests and treatment, is perhaps more significant than its diagnostic accuracy. It suggests that AI could be a powerful tool for reducing diagnostic error which remains a leading cause of patient harm. But that was on the typical sort of vignettes and not on real patient data that result. The takeaway for the clinical community is not necessarily that AI is ready to replace the triage nurse or the emergency room attending. Instead, the study highlights the urgent need for prospective trials. We need to see what happens when these models are used in a non selected population where the risk of over diagnosis is real. We need to understand how a physician's decision making might change if they're presented with an AI generated list of worst case scenarios for every patient that they see. It might become more difficult to de escalate care in those instances. The Harvard and MIT teams have initiated a really important path. They've proven that in a controlled environment for the sickest patients, AI reasoning can now be elite. The next stage of research must be to move beyond this isolated true positive accuracy and begin to measure the broader impact on the health system, including the costs of possible over testing and the psychological impact on clinicians and patients. The goal is to move towards a model of collaborative intelligence where maybe the AI provides an exhaustive cannot miss list and the human physician presides the judiciousness to decide which of these possibilities warrants action. The study is an optimistic signal that we're entering an era where AI may one day be able to handle some of the heavy lifting of data synthesis, potentially freeing up clinicians to focus upon the nuances of patient care. But there's still a lot of work to be done. We'll be trying to follow as much of that work as closely as possible on the channel. So don't forget to hit like and subscribe if you don't want to miss out on that.