Pre-, mid-, post-training - The Complete LLM Training Guide - The Health AI Brief

Summary5 min read

The Health AI Brief — Episode Summary

Episode Title: Pre-, Mid-, Post-Training – The Complete LLM Training Guide

Host: Stephen A
Date: April 23, 2026

Overview

In this episode, Stephen A delivers a rapid yet comprehensive breakdown of the full lifecycle of large language model (LLM) training for healthcare — from the initial "pre-training" on massive text corpora, through critical "mid-training" domain alignment, to post-training refinement with clinician feedback. The goal is to demystify how generative AI systems such as ChatGPT evolve from “digital scrapings” to trusted medical advisors, and to offer healthcare professionals a framework for evaluating the safety, reliability, and practical implications of AI models in medicine.

Key Discussion Points & Insights

1. Why Understanding LLM Training Matters

Black Box Issue: Many clinicians use generative AI without understanding the mechanisms behind it.
Clinical Analogy: Like studying pharmacokinetics of a drug, knowing an AI’s “mechanism” is crucial for trust and effective use.

“We wouldn't dream of prescribing a new monoclonal antibody without a working understanding of its mechanism… Yet in the clinical community, we often treat generative AI like a black box.” (00:09)

2. LLM Training: The Three-Phase Maturation

Phase 1: Pre-Training – The Knowledge Foundation

Purpose: Build a generalist “foundation model” — comparable to the early years of medical school.
Data: Trained on massive amounts of unlabelled text (internet, books, public medical literature, clinical guidelines).
Technique:
- Self-supervised learning (next-token prediction)
- Based on transformer architectures (pattern recognition via attention networks)
Capabilities:
- Forms a deep understanding of medical grammar, terms, and basic relationships (e.g., “diabetes and insulin”).
- Lacks clinical judgment; outputs are based on statistical probability, not truth.
- Prone to hallucinations and factual errors.
Quote:

“It's read the textbooks, but it's never seen a patient. It's prone to hallucinations because it prioritizes what sounds statistically probable over what's clinically true.” (03:10)

Phase 2: Mid-Training – Domain Alignment (Fine-Tuning)

Purpose: Aligns the foundation model to specific medical domains — akin to specialty training or residency.
Data:
- Shift to labeled clinical data (EHRs, medical notes, imaging, annotated labs).
Methods:
- Supervised fine-tuning (input-output pairs, e.g., triage categories)
- Model distillation: Using a large model to train a smaller, privacy-respecting local model.
Clinical Examples:
- RetFound: Foundation model for ophthalmology, trained on 1.6M retinal images — outperforms general models at eye/systemic disease detection. (09:15)
- Foresight/Foresight2: Predicts next clinical events from EHR timelines.
- AlphaFold: DeepMind model for protein structure, fine-tuned for protein "language".
Addressing Flaws: Tackles “tokenization” errors in medical codes (ICD-10, SNOMED) with symbolic foundation models.
Quote:

“This is the phase that most people skip, but it's where the actual medical AI is born.” (04:14)

Phase 3: Post-Training – Expert Alignment & Safety

Purpose: Applies human (or expert AI) oversight to refine, align, and safeguard the model; analogous to board exams.
Techniques:
- RLHF (Reinforcement Learning with Human Feedback): Experts rank/correct model outputs.
- RLAIF (Reinforcement Learning with AI Feedback): Secondary AI system acts as a critic.
- GPRO (Group Relative Policy Optimization): Multi-answer comparison and reward models.
- Constitutional AI: Embedding ethical/clinical rules (e.g., “always adhere to NICE guidelines”).
Outcome: Models become “agentic” — capable of reasoning, safe responses, and autonomous tool use.
Agentic Models: Move beyond chatbots, now able to query databases or implement code for clinical tasks.
Quote:

“Even a fine-tuned model after mid-training can still be dangerous. It might give an accurate answer, but in a tone that's dismissive. Or it might hallucinate a reference to sound more authoritative.” (13:40)
“This phase creates an expert aligned model. It's what makes the difference between a chatbot that just knows things and an agent.” (16:20)

Special Segment: Multimodal Training in Medicine

Why Multimodal?

Medicine is not just about text — AI must process images, audio, video.

Pre-Training:

MAST Autoencoders: AI is shown millions of partially obscured images to learn visual reconstruction.
Contrastive Learning: Models associate images (X-rays) with their text reports, learning to map visual patterns to clinical concepts.

Mid-Training:

Video: Action recognition in surgical footage.
Audio: Learning from annotated murmurs or lung sounds to detect patterns invisible to human ears.

Post-Training:

Expert Validation: Radiologists or surgeons provide feedback on what the model focused on or the timing of video segmentation.
Real-World Output: Enables clinically relevant applications, e.g., surgical guidance, ambient AI scribes.
Memorable moment:

“For imaging, we use visual attention validation. A radiologist reviews the model's heat maps to ensure that when the AI diagnoses a lung nodule, it's actually looking at the nodule and not being distracted by a surgical clip.” (27:22)

Practical Tips for Clinicians

Metric Literacy:

“Intrinsic metrics, which you might see mentioned in things like BLEU or ROUGE scores… are useful for linguists, but mostly useless for us. We're much more interested in extrinsic metrics like the SCORE framework.” (32:00)
- Look for extrinsic metrics: Human expert ratings of safety, consensus, explainability.
Know Your Model:
- Local, distilled models may be safer for clinical notes and administrative tasks than cloud-based generalists.
Expect Agentic Capabilities:
- Modern models should be able to reason (“show their work”) and use external tools, not just answer questions.

Summary Table

| Phase | Analogy | Goal | Data/Method | Clinical Example | |----------------|------------------------|------------------------------------------|--------------------------|-------------------------| | Pre-Training | Med School | Basic grammar, world knowledge | Unlabelled text/images | GPT-5, Gemini 3 | | Mid-Training | Residency/Fellowship | Domain expertise, reliability | Labeled clinical data | RetFound, Foresight 2 | | Post-Training | Board Exams/Attending | Safe, aligned, reasoning, agentic AI | Human/AI feedback, rules | Chain-of-thought models |

Most Notable Quotes

On understanding AI’s development:

“It's the journey from a pile of digital scrapings to a model that can predict a patient's disease trajectory.” (01:18)
On the agentic shift:

“We're moving from models that simply answer questions to agentic models… that can now use external tools like search engines or calculators to verify their own work.” (35:10)
On clinical leadership:

“This pipeline is how we move from being passive users to informed clinical leaders in the AI era.” (37:55)

Timestamps for Key Segments

[00:09] Importance of understanding LLM mechanics
[03:10] Pre-training detailed explanation
[04:14] Introduction to mid-training and domain adaptation
[09:15] Clinical case studies (RetFound, Foresight, AlphaFold)
[13:40] Post-training & expert alignment
[16:20] Agentic models vs. chatbots
[27:22] Multimodal training & visual attention validation
[32:00] How to evaluate AI models (metrics)
[35:10] The agentic shift in medical AI
[37:55] Why clinicians must understand the pipeline

Conclusion

This episode is a concise, high-yield resource for clinical leaders who want a defensible understanding of how modern medical AIs are trained, evaluated, and aligned for real-world tasks. Mastering these concepts will empower healthcare professionals to critically appraise, deploy, and lead in the AI-driven future of medicine.

Loading summary

Transcript1 lines

[00:01]
A
Welcome to the Health AI Brief. Breaking down the AI shaping our world one concept at a time. We wouldn't dream of prescribing a new monoclonal antibody without a working understanding of its mechanism or pharmacokinetics, the way that it's absorbed, distributed and metabolized. Yet in the clinical community, we often treat generative AI like a black box. We see the output, but we don't truly understand the mechanism and processes that created it. This lack of transparency is exactly why we see issues with hallucinations, bias and people pleasing responses that don't hold up under clinical scrutiny. Previously, we've done some short snapshot episodes on individual parts of this process. Today we're bringing it all together into one, hopefully definitive resource. We're going to look at the three stage maturation of a large language model, pre training, mid training and post training. This is the journey from a pile of digital scrapings to a model that can predict a patient's disease trajectory. By the end of this, you'll hopefully have a bit more of a definitive map of how a model goes from reading the Internet to advising you on a complex diagnosis. So the first phase is pre training. This is the knowledge learning phase. Think of it as basic medical school years, but on a massive global scale. It gives you the foundations of your scientific knowledge and the goal here is to build a foundation model. This is the generalist engine that understands the structure of language and the basic rules of the world. Just as in early medical school we learn the basic physiology of the body, the data is unlabelled data. This is raw information, the entirety of the Internet, millions of books and importantly the public side of medicine. Things like clinical guidelines, published research papers and administrative documentation. So that's the data. What about the mechanics? This relies on the transformer architecture that we've discussed previously. The transformer is essentially a pattern recognition machine that uses an attention network. It assigns weights to different words to understand context. During pre training, the model performs self supervised learning through prediction of next tokens, so called next token prediction. It's shown a sentence with a word missing and and must guess the token, the chunk of text that follows. So the clinical result at the end of the pre training phase is that you have a model like the base versions of GPT5 or Google Gemini 3. It develops a world class grammar of medicine. It understands the relationship between diabetes and insulin, or shortness of breath and pulmonary embolism. It can explain what a myocardial infarction is. However, it has no clinical judgment at this stage. It doesn't know what's true, it only knows what's probable based on its reading. It's read the textbooks, but it's never seen a patient. It's prone to hallucinations because it prioritises what sounds statistically probable over what's clinically true. And so then we have to progress to the next stage, phase two or mid training. This is domain alignment or fine tuning. It's the phase that most people skip, but it's where the actual medical AI is born. In the literature it's often called domain adaptation. This is the equivalent of specialty training or a residency. The goal is to take a generalist foundation model and force it to understand the specific shorthand and nuances and data structures of a specific context, like healthcare. The data here we move from unlabeled to labelled data. This can include things like electronic healthcare records, clinical notes, lab test annotations and medical imaging data sets like X rays or CT scans. And whereas in the pre training phase there was self supervised or unsupervised learning, here we have supervised fine tuning or sft. We give the model specific input output pairs. For example, here's a patient presenting with symptoms, the input and here's the correct triage category, the output. So we take the base model and keep training it, but only on high quality medical journals, textbooks and internal hospital protocols. Another important concept here is something called model distillation. This is a vital clinical tool in some circumstances. Here you can take a massive expensive flagship model and use it to teach a much smaller open source model. This results in a domain specific model that's small enough to actually run locally on things like hospitals to private air gapped servers completely separate from the Internet, which can help create models while protecting patient privacy. So some clinical examples of these, something called retfound. This was a foundation model for ophthalmology. It was mid trained on 1.6 million retinal images learning to reconstruct missing pixels in fundus photographs. It now outperforms general models at detecting both eye disease and systemic issues like heart failure just from an eye scan. Another example is foresight and foresight 2. These are clinical transformers fine tuned on electronic healthcare record timelines. They don't just talk, they forecast. They can predict a patient's next diagnosis or the likelihood of a complication by treating a patient's medical history like a sentence where the next word is the next clinical event. We've discussed this previously in an episode regarding EPIC's comet model AlphaFold, the model from DeepMind predicting protein structure is also an example of this. These are fine tuned for the language of proteins, allowing them to predict 3D structures or even engineer new fluorescent proteins that don't exist in nature. MID training is where we fix a large flaw in AI and that's tokenization. General AI often breaks clinical codes like ICD10 or SNOMED into meaningless fragments during MID training. We can use symbolic foundation models that treat these codes as discrete unbreakable units, making the AIs administrative and billing work significantly more accurate. Then we move on to phase three or post training. This is where we give expert feedback and align the models. This is an equivalent of board exams and consultant or attending oversight. This is where we refine the model's behavior and safety. Even a fine tuned model after MID training can still be dangerous. It might give an accurate answer, but in a tone that's dismissive. Or it might hallucinate a reference to sound more authoritative. The goal here is to ensure that the model's safe, unbiased and provides reasoning rather than answers. So in terms of the actual mechanics here they include reinforcement learning with human feedback. Expert clinicians rank a model's output. If the model suggests an inappropriate drug dose, the human penalizes that path. The model then updates its internal policy to avoid similar mistakes in future. Another approach is rlaif. So that's reinforcement learning with AI feedback and GPRO Group Relative Policy optimization. So this is used by newer models like deepseek. R1 was one of the early ones to pioneer it using AI feedback instead of relying on only on humans. A second critic AI can evaluate responses. The model generates multiple answers and compares them against a critic or reward model and a set of clinical rules. This encourages chain of thought reasoning. The model literally thinks out loud in a hidden scratchpad before giving you its final conclusion. Another tool is constitutional AI. We give the model a set of rules or a constitution that for example always adhere to nice guidelines. Prioritise patient safety over brevity. The model then fine tunes itself to ensure its outputs never violate these strict rules. This phase creates an expert aligned model. It's what makes the difference between a chatbot that just knows things and an agent. Like most of the state of the art frontier models, these can autonomously solve multi stage clinical problems like designing a drug trial or querying an external database to verify a dose. This phase can create agentic models AI that can autonomously use external tools. For example, an agentic model wouldn't just guess a drug interaction. It would decide to query a verified pharmacology database. Implement the code to check the interaction and then present you with the verified answer, rather than just trying to draw upon the answer from memory. So that's an outline of pre training, mid training, and post training. Now, to hopefully try and really anchor these concepts, let's look at how this pipeline handles the sensory data. The we use every vision, sound, motion, not just the text that we've largely described above. Because medicine is inherently multimodal, the training process has to be far more sophisticated than just reading text. In the pre training phase, the goal is to teach the basic physics of medical data. For vision, we use things called MAST autoencoders. The model is shown millions of images, like fundus photographs or CT slices, but with a random section blocked out. The AI's only job is to reconstruct those missing pixels. It's also during pre training that we use something called contrastive learning to glue modalities together. We feed the model millions of pairs, for example, a chest X ray paired with its formal radiologist's report. By doing this, the AI learns that the visual pattern of patchiopacification in the lower lobe mathematically corresponds to the linguistic term pneumonia. It's not just learning words, it's learning to map sights to concepts. We then move to mid training. We begin the specialty residency for video. This might involve feeding the model thousands of hours of annotated surgical footage. We use action recognition and temporal segmentation. So the AI learns to distinguish a gallbladder dissection from a clip application. This is exactly how the model I discussed before RET found the ophthalmology model we mentioned just before, which was fine tuned on 1.6 million retinal images until they could identify not just eye disease, but the subtle vascular patterns that predict heart failure. For audio, mid training involves acoustic feature learning. We take raw audio recordings of things like heart murmurs or lung crackles and pair them with the eventual definitive diagnosis. This allows the AI to develop superhuman hearing, as it were, identifying the frequency shifts of a grade 2 systolic murmur that a human might miss in a noisy ward environment. Finally, in post training, we perform expert alignment to ensure these sensors are clinically reliable. For imaging, we use visual attention validation. A radiologist reviews the model's heat maps to ensure that when the AI diagnoses a lung nodule, it's actually looking at the nodule and not being distracted by a surgical clip at the edge of a rib. If the AI looks at the wrong thing, it's penalized through RLHF. The reinforcement learning from human feedback that we mentioned Earlier, a human expert giving that feedback for video surgeons might provide feedback on temporal accuracy, ensuring that a surgical guidance AI isn't just right about what's happening, but also when it's happening. The post training is what allows for deployment in the real world, whether it's an ambient AI scribe using audio to draft a consultation note, or a surgical guidance system in the operating theatre that acts as a second set of eyes, alerting you if you're approaching the common bile duct by the time it reaches your clinic, the AI has been through a rigorous sensory upbringing that mimics the years of observation that we undergo as trainees ourselves. So why does this exhaustive look at the pipeline matter to you? First, it can be helpful to help distinguish between intrinsic and extrinsic metrics. Intrinsic metrics, which you might see mentioned in things like blue or rouge scores, tell us how well an AI mimics language. They're useful for linguists, but mostly useless for us. We're much more interested in extrinsic metrics like the score framework. These involve human expert rating the outputs for safety, consensus, objectivity, reproducibility and explainability. When you're appraising a paper of a new medical AI, look for the extrinsic metrics. Next is it can be helpful to know the distillation status if you're using a model for administrative tasks or clinical notes. A smaller distilled model running locally on your hospital's own hardware is often safer and more reliable than a massive cloud based generalist. It also helps explain a lot of the big breakthroughs that have been made in the last year or so with the advent of reasoning models. This is a result of post training for chain of thought, so rewarding models that provide this reasoning chain of thought process. If a model can't show you its work, the logical steps it took to reach a diagnosis, it shouldn't necessarily be considered as reliable for things like clinical decision support. It's also important to understand the agentic shift that we're seeing recently. We're moving from models that simply answer questions to agentic models. These are models that have been through the full pipeline and can now use external tools like search engines or calculators to verify their own work. So, in summary, phase one, pre training builds the grammar, the basic physics of understanding. Phase two, mid training, builds more domain expertise like medical expertise. Phase three then builds the clinical judgment to ensure that outputs are safe and aligned to what experts would deem appropriate understanding. This pipeline is how we move from being passive users to informed clinical leaders. In the AI Era. If you found this helpful, please don't forget to hit like and subscribe so that you don't miss future content.