Summary9 min read

Latent Space: The AI Engineer Podcast

🔬 Training Transformers to solve 95% failure rate of Cancer Trials

Guests: Ron Alpa (Co-founder & CEO, Noetik), Dan Baer (VP of AI, Noetik)
Hosts: RJ Haneke & Brandon Anderson
Date: April 20, 2026

Episode Overview

This episode of Latent Space explores the fusion of foundational AI models and experimental biology to solve one of cancer drug development’s biggest problems: the shocking 95% failure rate in clinical trials. Featuring Ron Alpa and Dan Baer from Noetik, the conversation takes a deep dive into how multimodal data collection, custom transformer architectures, and self-supervised learning are reshaping personalized medicine—by identifying which drugs will work for which patients, and why traditional preclinical models have systematically failed. The episode is rich with technical insights and pragmatic reflections on the state of AI-driven bio discovery.

Key Discussion Points & Insights

1. Why 95% of Cancer Drugs Fail

Ron Alpa: The primary issue isn’t with pharmacology, but with patient selection. Drugs fail because we don’t know which patient subtypes will respond.
- “Most of those drugs fail, we’d argue, because we’re bad at selecting which patients those drugs are going to work in.” (01:40)
The goal at Noetik is to use AI models on human patient data to fundamentally understand patient biology at the outset, and position drugs in the right populations.
- “Can we build models that fundamentally understand patient biology from the very beginning and help you position molecules in the right patient population?” (01:40)

2. Data Generation: “No Public Data Set Would Cut It”

Noetik chose to generate its own data from human tumor samples, building a proprietary lab pipeline for multimodal data—images, protein stains, spatial transcriptomics, and genotyping.
- “We generate all our data in the lab... In bio you really need to be intentional about the data that you generate and how you generate it.” (11:27)
Their data is “orders of magnitude” larger and more controlled than academic sets, supporting ambitious self-supervised learning goals.
- “We’ve generated now... more than 100 million cells spatially resolved spatial transcriptomics, that’s all paired with H and E and protein as well. At least an order of magnitude larger than any of the other data sets that we’ve seen out there.” (39:21)

3. Multimodal Data and Model Design

Data types stack from tissue images (H&E), immune/protein marker stains, to 20,000-channel spatial transcriptomics per cell—all on randomized arrays to control for batch effects.
- “If you just sat back and you said, ‘I want to train a foundation model that understands human biology,’ what does that mean? ... You want tissue level biology, you want to understand tissues. That’s why we generate pathology H & E...” (20:19)
Rich, image-centric data allow them to “calibrate” for technical variations and reduce noise, enhancing downstream model interpretability.

4. Virtual Cell and World Modeling

Noetik’s notion of the “virtual cell” isn’t a physics simulation, but a practical abstraction to predict cell or patient-level responses—most relevant for drug development decisions.
- “From a virtual cell perspective, really what we want to do is understand cell biology in some heuristic that’s useful for making drugs.” (26:51)
They run simulations such as: If I knock down gene X, what happens to the tumor microenvironment, especially immune response?
- “You might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function ... or decrease the tumor’s ability to grow.” (33:12)
The same model infrastructure can work for trials of new drugs, rescued failed trials, or even as clinical diagnostics.
- “If you can use that same input, just H and E, for experimental drugs, why not use it also for drugs that are on the market already?” (38:16)

5. Model Architecture: Masked & Autoregressive Transformers

Their first models (e.g., OctoVC) used masked autoencoding (like BERT): mask out 99% of image+transcriptomics data, predict the missing, fostering holistic representations.
- “If you only mask out 10% of the image... [models] learn these kind of boring behaviors... They don’t end up learning anything about sort of the holistic structure of the image data.” (55:20)
Their new Tario model switches to an autoregressive transformer, predicting next “tokens” (image patches, gene counts) in sequence for better long-context modeling and improved scaling.
- “Tario was ... our subsequent attempt to bring that autoregressive, next token prediction task into modeling spatial transcriptomics data. We found ... much better scaling behavior where bigger models, especially at longer context lengths, were really outperforming.” (57:10)
- “[Larger models] only really see the benefits ... when you’re looking at longer context lengths ... you really do need to incorporate more of the patient’s spatial context to really get the models to learn these more complicated nonlinear patterns.” (58:47)

6. Validation and “In Silico” Humanization of Mouse Models

For regulatory reasons (e.g., the FDA), humanized mouse models remain a necessary evil; Noetik devised ways to make mouse tumor samples readable by their human-trained models.
- “We’re in silico humanizing the mouse. So all the outputs in terms of the transcriptome from the mouse are in the form of the human genes.” (45:59)
Their models can recognize cold (“immune desert”) and hot tumors in both human and mouse, and even recover known pathway effects from genetic perturbations.
- “A model that was trained on human data, then you show it some mouse histology ... is able to say these five different tumor genotypes all look like they have the same phenotype. And ... there are five genes that are in the same pathway.” (52:52)

7. Why Not Just Use Public Data?

Academic/public datasets aren’t big or consistent enough; “critical mass” must be reached to unlock new capabilities (echoing the history of PDB in protein folding).
Generating good data is itself slow and uncertain, often taking years before enough is collected to train a working model.
- “We basically opened the lab, hired a team, got all the instruments, started sourcing tumor samples... there was no prior here that any of this would work. Like, zero. We just started generating data...” (66:35–66:49)

8. Industry Impact and Data Moat

Noetik closed a major $50 million deal with GSK, licensing their core virtual cell foundation model for discovery and internal fine-tuning, including recurrent licensing.
- “We announced ... a $50 million deal ... an upfront payment, milestones and then separate than that also includes an annual license fee ... the substrate is actually a model.” (60:26)
Pharma’s internal data is often siloed and small compared to the scale needed for foundation models, making specialized startups like Noetik essential ecosystem players.

9. Lessons for Bio/AI Startups

Design your data collection for the ML task you need to solve; not every “dataset” is useful for modeling.
- “I think oftentimes I see like a lot of companies are like, okay, well I want to generate X Data set. I’m just going to generate X Data Set and I’m going to do machine learning on that. That might not be the right data set.” (71:25)
Pay attention to the rapid evolution of experimental technologies—capabilities grow exponentially every few years.

10. Big Picture: Abstraction Beats Mechanism

Instead of bottom-up, mechanistic cell modeling, Noetik bets on learning functional abstractions at the patient/tissue level—similar to how deep learning overtook biophysical neuron models in neuroscience.
- “My bet is ... by modeling at the level of functional tissue, where you have a bunch of cells interacting in a disease context, that’s going to get you to the problem of predicting patient level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together.” (79:55)

Notable Quotes & Memorable Moments

On the core thesis:
“Most drugs fail... because we’re bad at selecting which patients those drugs are going to work in.”
— Ron Alpa (01:40)
On intentional data generation:
“In bio you really need to be intentional about the data that you generate and how you generate it and have some foresight around ... what are the models we’re going to want to train?”
— Ron Alpa (11:27)
On the data moat:
“At least an order of magnitude larger than any of the other data sets that we’ve seen out there. And ... if you drop down to 40% or 10% of that data used in training, the models get a lot worse.”
— Dan Baer (39:21)
On perseverance:
“We just started generating data ... there was no model. It was like, how many years? Like two years, maybe a year and a half at least before we had the first train models working.... there was no prior here that any of this would work. Like, zero.”
— Ron Alpa (66:35–66:49)
On bio data scaling:
“Imagine trying to train a foundation model on not enough data and then that’s... your clinical trial. Right?”
— Ron Alpa (74:40)
On the analogy to neuroscience:
“What actually ended up working ... is this abstraction. [We] just treat neurons as linear-nonlinear units... Those are now by far the most predictive models ... My bet is that the same is going to be true for these models, too.”
— Dan Baer (79:55)
On the future of AI in biology:
“Maybe we’re in the first inkling of the ChatGPT moment for bio, but it’s like very much just the very beginning.”
— Ron Alpa (82:24)

Important Timestamps & Segment Highlights

01:40 – Why cancer drugs fail & Noetik’s founding thesis.
11:27 – Why and how Noetik generates its own large-scale, multimodal data.
20:19–25:35 – Explaining data types: H&E images, protein stains, spatial transcriptomics.
26:51–35:19 – Virtual cell models, simulation use cases for protocol design and diagnostics.
39:21 – What makes Noetik’s data and models uniquely powerful.
45:59–53:37 – Bridging human and murine models; “in silico humanized” mouse inference.
53:52–59:23 – Model architectures: OctoVC (masked) vs. Tario (autoregressive); importance of context length.
60:26–66:33 – GSK deal; business implications of foundation models in pharma.
66:35–71:25 – Perseverance in data generation; lessons on startup risk and scale.
71:25–73:26 – Advice for bio-AI startups: data design, technology pace.
79:55–82:24 – Abstraction in AI for biology: broader implications.
82:24–83:37 – Call to action for AI engineers, and why biology is the next frontier for ML.

Final Thoughts & Calls to Action

Get stoked about biology as an AI frontier.
“These are really exciting...problems. There are real, significant ML problems in the space. One call to action is... be more stoked about learning about applications of machine learning in biological sciences and solving some of these hard problems.” (82:29)
There’s massive opportunity for impact in bio for ML experts—companies like Noetik are hiring people who want to tackle first-principles, foundational problems.
Much remains to be solved before AI can fully deliver on clinical impact, but foundational models at the patient level are unlocking new, practical advances day by day.

Listen to the full episode and explore notes at:
https://latent.space

Loading summary

Transcript170 lines

[00:00]
Ron Alpa
So we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples and there was no prior here that any of this would work. Like zero. We just started generating data and like sourcing human tumors. Processing. We built this whole processing pipeline to get the tumors into like these arrays and the formats. So you've got like these two week runs where you're processing two slides and we're just churning data from for months and we couldn't even train a model. So we sort of just built all this and then then like let's say 18 months later, hey, I wonder, can we train a model off of. And then it was not, you know, like it wasn't obvious.
[00:38]
Dan Baer
Yeah, there wasn't really like anything major to go off of. I mean there were like transformers developed for single cell data. There just like weren't really data sets out there that people had been able to develop on. We do a lot of like custom model building.
[00:56]
RJ Haneke
Hi there, I'm RJ Haneke and this is Brandon Anderson. We're the co host of the Latent Space Science podcast. And today we're really happy to be in the studio with some of the people from Noetic.
[01:07]
Ron Alpa
I'm Ron Alpa, co founder, CEO of Noetic, position scientist by training. My hobbies are making hot takes about AI curing cancer.
[01:17]
Dan Baer
Hi, I'm Dan Baer, I'm VP of AI at Noetic. I'm a biologist by training, did PhD in neuroscience and then moved into comp, neuro, computer vision, self supervised learning and have been doing AI research at Noetic for the past few years.
[01:33]
RJ Haneke
Maybe we should start with what is Noetic? Why did you found it? What is the difference between Noetic and the other virtual cell.
[01:40]
Ron Alpa
Yeah, companies. Maybe just start with a little bit of a contrarian thesis which is really the reason for founding Noetic. It's we all know the numbers that 90%, 95% of cancer drugs fail in the clinic. Why do they fail? So our thesis is they fail not because we're bad at pharmacology, not because we're bad at target selection. You're making the drug. We're actually better at that process than we have ever been in the history of drug development. Most of those drugs fail, we'd argue because we're bad at selecting which patients those drugs are going to work in. And oftentimes you see trials where there is no placebo effect in cancer. Some patients respond to these drugs and if you have a patient that responds, that tells you something that there's some biology that, that's active there. But you have a problem in patient selection. And so really, that's the thesis behind OI Lake is can we build models that can fundamentally understand patient biology from the very beginning and help you position molecules in the right patient population?
[02:40]
RJ Haneke
So you're actually using the models, partly at least, to select the patient cohort, not just so it. You can imagine it working either way. You could design, oh, I think that this molecule will do well because I know something about the patient population. But you could also say, I think that this patient population is the match for this molecule.
[02:59]
Ron Alpa
And that's where the power of the, of the models is like. Once you've trained these models on, on patient data, you can use them on both sides of the equation. So you can use them for discovering new targets directly from the patient data, which people often refer to as reverse translation. So starting from humans and then trying to understand which targets to go after, and then you can use that to develop molecules, but you can also use them directly on patient data. If you have, you know, let's say a phase two or phase three trial, you can use these models to understand which patients or what underlying biology of the patients in the trial is a predictor of response. And we've been doing a ton of that recently.
[03:43]
RJ Haneke
Are you doing a lot of, like, rescuing trials that had a bad effect?
[03:48]
Ron Alpa
We are doing a lot of looking at, like, data from phase 2, phase 3 trials and then using the models essentially to run inference on, on patient biopsies and understand whether there's underlying biology that would help us design the next trial. We haven't shared any of that. Yes, but you'll see this too.
[04:08]
Brandon Anderson
So cancer is kind of like infamous in that there are many, many different types of cancers. Whenever it says cure cancer, that is almost a meaningless, vacuous statement. So your point is, even amongst cancer, or you pick a specific type of cancer and then a subtype and a subtype, there's a bunch of different patient populations that each one of them will respond differently to drugs. And your point is you can figure this out right now that, like, some subpopulation will do well and respond to this drug when you think, generally speaking, the rest of the population would not. Even though we have historically classified this as like, what type of cancer, what indication or so on.
[04:43]
Dan Baer
Yeah, that's exactly right. And I would maybe even go further and say, like, nobody actually knows what the subtypes are. There are cancers that originate in a certain tissue, like the lung, that, you know, have been classified into subtypes based on pathologists looking at them for, you know, more than a century. And, you know, those subtypes certainly have some connection to the real, like, carving nature at its joints, like, what are the actual functional subtypes of disease there? But our thesis is kind of that if you look at the data, a much richer kind of data, so the multimodal data that we're generating in our lab, we're going to see that actually, you know, what people thought was one subtype of lung cancer is really three distinct subtypes of cancer. And that is going to be critical for figuring out which patients should get which drugs.
[05:39]
Ron Alpa
Yeah, maybe I'll just go back to, like, one of your first questions. And, you know, I was saying, like, drugs don't, you know, many drugs fail in patients because we don't understand which patients they will work in in oncology. Why do we end up in that situation? So whenever you make a new job, you do a set of experiments in cell culture cells in a dish. Those cells are often cell lines. These cell lines have existed for 40, 50 years, and they're immortalized. So they have genomes that allow them to persist, that have abnormal numbers of chromosomes. They have gene expression patterns that don't represent any known cell in, like, the human body, really. These are sort of frankensteinian cells, like cancer drive. They were mostly cancer. And then. And so you can do your experiments in these cell lines in a dish, or then you can move these into animal models. And in oncology, you often have sort of a panel of different animal models with different cancer types that you'll test these in. And we, in doing these experiments, we sort of convince ourselves that some of these cell lines are, let's say, lung cancer cell lines or colon cancer cell lines. And then even that some of them in the mouse context are colon cancer cell lines and lung cancer. And then in the mouse, we implant them under the skin in, like, weird places, and we treat the mice with drugs and we see how they respond. But ultimately, there's a big gap because they don't translate to patient biology most of the time. So patients, these cancer cell lines, most of them don't even, you know, even if they are derived from a colon cancer, they don't even have the mutations that human colon cancers have in many cases. And so. And pharma has done this for, you know, 20, 30 years, where you, you develop a drug, you test it against, you know, hundreds of these. It's not an art experiment. We can, you can send this out to any CRO, they'll test your drug against hundreds of different cancer cell lines and you can sit back and say, okay, well which of the 50 colon lines responded to my drug and which of the 50 ovarian cancer lines? And you could try and map that to human biology. But the problem is these cell lines as an abstraction do not relate in any way to human patients. And so what happens is ultimately, no matter what you do, preclinical, the molotool gets in the clinic and the clinical team says, look, we don't really know how to design this trial because none of the data that you've produced gives us any insight on which patients to run. So we're going to, we're going to basically enroll an open label study. So we're going to enroll all tumors, all patients that are enrollable in this trial and we're going to see where we get signal. Imagine doing that in an early phase trial where let's say you have 50 patients and you're trying to do, you test different doses and you don't really know the dose of the drug and you don't know what the safety margins are and you're also trying to figure out where is my signal. And then what if I told you that let's say ingest lung cancer, hypothetically, let's say there's only 10 different subtypes of lung cancer and you don't even know if it's lung. It could be anything. So, you know, this is what happens. And oftentimes you get to the end of these early stage trials and you don't see very many responders as you would expect statistically, and then these molecules get canceled.
[09:05]
RJ Haneke
So you're imagining that your noetic system, you help the pharmaceutical company to characterize. We expect that people with a certain genetic profile or even transcriptomic profile will, will respond to this drug. And then you go and you actually sequence from the patient and you say, yes, this is a match. Is that the sort of grand vision?
[09:32]
Ron Alpa
Yeah, I mean, I would say we are even less biased than that. We are saying, okay, well we want the model to learn, let's say from lung cancers. We want the model to learn like how many different therapeutically relevant subtypes of lung cancers are just from self supervised learning from the data. And those subtypes could be driven by large genetic changes, they could be driven by, you know, immune changes. It could be really driven by any biology that the model is learning in the process of training, you know, and we do see, you know, different types
[10:05]
Dan Baer
of, I mean Feel free to contradict this like as the actual doctor here. But like, you know, the, the biomarkers that you know, people have been using are, you know, biased towards simplicity. You know, does the patient have this particular mutation sometimes like stain for this single protein or you know, do transcriptomics like to, to look for a particular gene signature. But like, there's no reason to think that biology or like biology of cancer is that simple, that you're going to capture, you know, most of the meaningful variation with such simple biomarkers. And you know, most of them, they have like weak correlations with, you know, clinical success. But the hypothesis really is here. Like again, if you were to carve nature at its joints and figure out what's really going on, is there these five subtypes that the correlation there between which patients you give a particular drug and whether you have success is much, much stronger than if you're forcing yourself to go with these very simple biomarkers.
[11:16]
RJ Haneke
You mentioned the lab, you do a lot of data generation in the lab. So why do you think that that versus using existing public repositories or whatever is appropriate?
[11:27]
Ron Alpa
Yeah, we generate all our data in the lab. Everything from sourcing tumor samples themselves to processing them and generating the data. Maybe another hot take I have just in AI and bio is you're sort of not at the order of magnitude of data that you are in other spaces of building training models. And so it becomes really hard to brute force these problems just by collecting data. We have a couple pretty good examples of where someone has designed a Data set. So PDB was designed and has been built over the past 50 years or so. And so it's not an accident that that data set exists. Someone decided that we are going to design this data set, we're going to collect this data over decades and decades. And then with the intuition that potentially this would help solve protein folding down the road. And it did. So it's not just that PDB is a bunch of random data that has been that people have organized from the web. I think that in bio you really need to be intentional about the data that you generate and how you generate it and have some foresight around, well, what are the models we're going to want to train and what are the models going to need to learn from the very beginning? So that's why we've taken this approach.
[12:45]
Dan Baer
Yeah, and I mean a good comparison is to the ImageNet dataset, which kicked off the deep learning revolution in computer vision with convolutional neural networks, like actually demonstrating that neural networks can do better than other methods on object categorization. ImageNet is at least the part of it that people were developing models on is 1.2 million images, very carefully curated. These are high quality images, not like random images from the Internet or like multiple data sets cobbled together and labeled. Yeah, and labeled. And I think with the data that we're generating, we're around that scale right now. But you know, of course people have gone much, much larger in image data sets and language data sets, text data sets obviously for LLM. So we think that we need to get the data up to that scale before we can really see the meaningful progress on the algorithm side, the scale of language data. Yeah, language is really the only modality where people are seeing these very impressive scaling results. And part of that has to be just the scale of data that's there and that the models are trained on. That can't be the only thing because you know, there's a lot of like video data as well. People are training on like thousands of hours of video data and you know, haven't seen kind of the scaling results that you have in language modeling. But having the right scale of data is necessary if not sufficient to like really make progress here.
[14:27]
Brandon Anderson
Kind of for a contrarian take to that, sure. So I mean there's this whole concept about the giant frontier of LLMs in derivative AI and how like certain regions that can be really good at solving some problems and then remarkably stupid at solving nearby problems. And maybe the argument with happening is that a lot of these frontier models are just becoming massively like everything is becoming in distribution. Like if everything starts out. Odo D if you just get more data, it now becomes in distribution. Is it possible that for biological systems, because these are they're underlying physical processes here, that you can basically make things more in distribution earlier in that you can't actually cover the space. I kind of have some follow ups with pdp, but maybe I'm just curious at this point.
[15:15]
Dan Baer
Yeah, I mean I think it's a good question is like sort of how much data and what kind of diversity do you need like in biology to solve, you know, say like the drug translation problem, like figuring out which drugs are going to work in which patients. My intuition from working in biology like for a while is that we're still pretty far from that because we're building data sets that are focused on right now cancer and I've generated data from thousands of patients in a few major cancer subtypes. But there's like every other Disease, there's healthy tissue, there's even other species. There's a lot of biology to learn. Especially if you think about it as we have to learn kind of the spatial and functional patterns of tens of thousands of genes, tens of thousands of proteins, how their spatial arrangement contributes to the function of organs and so forth. You know, my hunch is that biology is like pretty complex and that we still need to generate a lot more data. But yeah, I, I don't know.
[16:24]
Brandon Anderson
Yeah, but as a cancer company, do you think you could actually do this hypothetically for cancer? I mean for at least some subclasses of cancer?
[16:31]
Dan Baer
Definitely, yeah. I think that we've done experiments that suggest that if we can generate data from several hundred patients in all of the major cancer indications and some of the less major indications that that will result in a model that can generalize pretty well to kind of any type of cancer we would throw at it.
[16:54]
RJ Haneke
Backing up when is the data you're collecting? Because it I. My understanding is you use some pretty specialized instruments and gathering very specific data sets. So how did you come to that, that decision about how much data, how much to spend on it and what types of data?
[17:10]
Ron Alpa
I'll give a hat tip to my previous employer, recursion. So we spent six years at recursion from the very beginning. And a lot of what we were doing in the early days was figuring out like the things we didn't understand about the data sets and figuring out what the problems would be in the data set. So batch effects, control side of the orient, samples on plates, things like that. Flash forward to founding of Noetic. Started the company, you know, already with some, with some principles around how we should think about building the data set. What are some things that we know matter? So for example, over many years we learned that images are actually a really powerful data set for machine learning for many reasons. One, their scalable. So we can put patient samples on slides and on a single slide we can capture many patients worth of biology. The images themselves are very rich sources of biological information. Beyond that, now we have a very information dense modality and we can decrease the cost of data generation. So then we can increase the amount of data generation over the whole data set. And that's always been a really big benefit to image based modalities over let's say sequencing where every time you run a sequencing run you're basically your an is, you know, a patient. That was one one way to think about it. The other was how do we design these data sets so we can control for Things that we know are going to be important, such as batch effects. So for example, if I have a slide, we do a, so let's say a spatial transcript down on that slide. You stain the slide, do a bunch of, you know, wet lab processing, you put it into a machine, you get data out. If you do that on two different days, there are going to be different variables that impact the data buffer. That's going to be a large source of variation in data sets. So you want to be able to control for things like batch effects. So really you want, you'll have more patience, more represented on multiple different slides, so you can process them different in different batches. So you want to be able to control for things like this. So you can go downstream and look at the data and say, okay, well, once we have, let's say, patient level embeddings, we can ask, well, do the patient level embeddings represent, let's say, patient response to immune therapy, or do they represent staining batches?
[19:30]
RJ Haneke
So you're actually taking different patients, one patient, and you're spreading across multiple sides so that you can get a, like a, is sort of a calibration across the slides.
[19:40]
Ron Alpa
Yes. Our data looks very different than anyone in the space of generating data on histology or digital pathology types of specimens. So we, we receive a sample, we sample those samples dozens of times to build these arrays. And each array has, you know, hundreds of different patient samples randomized, and every patient is represented on multiple different arrays. And so we're getting a lot of different representations of each patient that we're seeing, sending through the data process and pipeline. And then that lets you downstream be able to answer some of these questions and control for some of these barriers.
[20:15]
RJ Haneke
You mentioned some terms I just want to define for people. Spatial transcriptomics.
[20:19]
Ron Alpa
Yeah.
[20:19]
RJ Haneke
What is that?
[20:20]
Ron Alpa
Yeah, so what? I mean, this was your first question, so what are the data types? So you just sit back and this is not my background in terms of spatial. Again, everything we did previously was cell biology in a dish. If you just sat back and you said, okay, I want to train a foundation model that understands human biology, what does that mean? How would you go after that problem? And that was really the starting point for the company is, okay, from first principles, how would we do this? So you probably want tissue level biology, you want to understand tissue, cells are organized into tissues. You probably want some modality that is relevant in clinical use so you can relate clinical data to what your models are learning. That's why we generate pathology H and E. So that's you know what, every patient gets a tumor removed and then they get this stain on H and E. And that's what the pathologists.
[21:09]
Brandon Anderson
I can't explain what H and E
[21:11]
Ron Alpa
is on basically two, two different dyes, hematogicillin and eosin. And it, you know, really just creates a contrast over the tissue. So you've probably seen these like, purplish pathology specimens. So pathologists can look at those and they can identify different cellular structures. And they've used those to classify tumors based on the classical classifications of, you know, adenocarcinomas, small cell carcinomas, things like that are basically cellular structures.
[21:43]
Brandon Anderson
Okay. So there's like specific patterns which show up when you add these two semes. And it is well established that like,
[21:50]
Ron Alpa
you classify tumors based on, based on, yeah, pathology on your classifications. And this is what every, basically every tumor know that gets processed in the hospital. We'll get this he state. And it's how the pathologist typically classifies a tumor from, from the first level. So. Okay, so you want that. You probably also want to understand cell types. It's really hard to understand cell types from just that stain because it doesn't reveal that much that a human can use to classify cell types at least. So you can say, well, I, I want to know whether there are immune cells and different subtypes of immune cells. We want to have some layer of cell biology. Okay, so.
[22:30]
Brandon Anderson
And you want to know about immune cells because you have these cancer cells and oftentimes the immune response dictates whether or not you have an effective treatment
[22:37]
Ron Alpa
or it's like the immune environment of the tumor will be a core, we know, is a core constituent of whether a patient's going to respond or not. So you want to know, okay, you want to give them all this. So the model is going to get this tissue level information. There's not enough cell level information in there for the model to learn enough cell biology of all different subtypes. So we also want to present it with some cell level information. So we use protein stains. So standard NEO fluorescence. So you basically use antibodies against small set of cell markers to label, you know, different T cells, B cells, you know, standard subtypes of cells in the tumor and microbiome.
[23:16]
RJ Haneke
So in this stain, just for those who are familiar, the stain on the antibiotic antibody has a fluorescing protein. When you hit it with a certain frequency of light, then it fluoresces. So you can tell the antibody bound to a certain protein. And now it has a fluorescing guillotine attached to it.
[23:36]
Ron Alpa
Yep. And in terms of the data, so from, from the, from the tissue layer, you have an RGB image. From the next layer you have a multi channel image with each channel representing, you know, let's say one color. And so for example, certain immune cells are each in a different channel. So you have this multi channel image now. Okay, so that's great. So we've got tissue, we've got cells, but if we actually want to make drugs, we need some, some type of molecular information. We need to tie all of this down to what's happening in the genome. What is the cell doing, what are the mechanistic principles of, of the biology? So then we get spatial transcriptome, so that, that's spatially resolvable rna. So DNA transcribed into rna which is translated into proteins. So we get basically the RNA in a spatially resolved pattern for the same cells that we're seeing all of these other layers. So now you have between 1,000 or 19,000 different genes. And again, these are all image layers that are spots of where those RNA are and which cells.
[24:41]
RJ Haneke
And this, this one works a little bit similar to the, how we talk about protein, where you have a segment of RNA and then you have a fluorescing protein. And usually there's some sort of combinatorial thing. So you have, if you see these four colors in this amplitude, then that means this gene because there's, they're right next to each other or something like that.
[24:59]
Ron Alpa
So for the detection method, you're basically binding a probe at each one of those RNAs and then you're cycling it and it takes weeks to run one of those assays. So you're cycling the machine, it'll cycle across each species and it'll amplify and you'll get a signal for each RNA species. Now at this point, you now have basically this very rich data layer where you have the tissue, you have the cells and you have the molecular information and you can use all of that to train the model. And so we, you think of it as, it's essentially the central dogma, if you will. And we also have DNA, we genotype, just so we understand the genomic alterations in these tumors.
[25:36]
RJ Haneke
All right, so you get this stack of images basically that you can train models on with understanding the expression of genes and the proteins that are being expressed at the time that the sample is taken, all in the image information. And then you can train your models with that.
[25:51]
Dan Baer
Yeah, I mean the spatial transcriptomics is like particularly dense because if you think let's say there are 20,000 genes in the genome. Now, you know, we're running assays that are detecting nearly all of them in a single sample. So you can think of one of those data points as an image, except instead of being an RGB image that has three color channels, now all of a sudden it has like 20,000 color channels. So it's like a very meaty computer vision problem to try to look at those data and figure out what makes patient A different from patient B and then go from that to which drug is going to work and which one.
[26:34]
RJ Haneke
And so you have a hot take about virtual cell. Like, I want to understand how, okay, so you, you know, you have this big pile of data that every single sample has a massive data set with it, and then you have many, many samples. So how do you turn that into useful knowledge?
[26:52]
Ron Alpa
Maybe just what is a virtual cell? Everyone's always, you know, asking that question. I think there are, there are really two ways to think about it. You know, one is we want to be able to simulate all the biochemical processes in a cell. So we want to have this sort of comprehensive foundation model where we understand, you know, if some signal from outside the cell interacts with the cell, then here are the millions of intracellular chemical reactions that are going to happen and you could sort of predict them from the model. So that's one view. I think that's interesting. It's sort of an interesting intellectual pursuit. I don't think we have all the modalities of data that you would need to solve that problem. I tend to see the virtual cell problem as something more practical. We're trying to make drugs that work in patients. So from a virtual cell perspective, really what we want to do is understand cell biology in some heuristic that's useful for making drugs. And the heuristic could be a way to understand drug targets or a way to map your cell level biology up to patient level biology. And so the way we've designed these first virtual cell models is really just to simulate the biology of a cell in some context and the biology of that cell being, let's say, the cell being in some context and the output being the transcriptome in that context or the protein in that context. And these types of input output relationships allow us to essentially design experiments. And so really the very simplistic thing that we're doing is really just the model can simulate the bioaday of cell or many cells in different contexts and give you, and a lot of you could run some simulations in that Regime.
[28:39]
Dan Baer
Yeah, I mean I think what most of the things that people are calling like virtual cell models right now are focused on single cell gene expression. So transcriptomics data, RNA data, and they're largely geared toward the problem of predicting what's going to happen to the transcriptome. So the set of genes expressed by when you hit cells with either a small molecule, a drug or a genetic perturbation. And typically this is cells grown in vitro, like either cell culture or primary cells, something like that.
[29:16]
RJ Haneke
I think that genetic perturbation being where I like knock out a gene or add a genitally and see how that impacts the expression of the various rna.
[29:27]
Dan Baer
Exactly. So and I think my view, and I think Ron shares it too, is that like may be of interest in some cases, but the problem we're really trying to solve is predicting what's going to happen in a patient. And you're just modeling data that comes from a patient is in my mind much more likely to translate to what happens when you give a patient a drug than something that's happening in cell culture.
[29:55]
RJ Haneke
Is there other clinical data that you're pulling into the model besides the actual. So you're calling the context of the cell just the surrounding cells, but it is there other, this drug caused a bad reaction kind of stuff?
[30:08]
Dan Baer
Yeah, I mean we're pulling in data from the entire patient, so not just, you know, the very local neighborhood of the patient. So far we haven't done much integration of, you know, like electronic health records or you know, other information that one could get about the patient. And that's pretty intentional. Like we really want these models to learn basic biology. Again, like the central dogma. Not just the central dogma, but, you know, the basic biology of genes, proteins, cells, tissue in a self supervised way. So purely from the data that we're generating and not be biased by, you know, what the doctor wrote about that patient. Because, you know, our thesis is kind of like most of the therapeutically predictive and important information is not contained in those very small number of, you know, patients who have been treated with a given drug and whatever the doctors thought was important to write down given the state of knowledge at that time. So it's much more about trying to discover what's really there in patient biology than go based on the text that people have written about it.
[31:25]
Brandon Anderson
So you have this self supervised model, you eat a lot of data, you have essentially some clusters of patients. Now how do you translate those clusters of patients to making decisions? Like you go to a pharma company and you say we can repurpose or we can suggest this subtype should be the focus of your phase two trials. Like, what is the process for that? What data do they need to provide to you, and how do you translate your models?
[31:53]
Ron Alpa
So it depends on what the problem is. I think it's important, so one may hold it back up. One of the more interesting aspects of these models is they are useful for a broad array of use cases, as, you know, as we were talking about from the very beginning. So you, as the pharma company could say, okay, well, I have this molecule, and the target of the molecule is X, and I want to design my clinical trial. The molecule has seen zero patients so far. All I know is the target and some biology around the target. So we can run simulations using the models and our cohorts of patients. And let's say if we were to look at in lung cancer, we can run simulations around the target and ask, okay, which sets of patients here would this target be important in? Across a cohort of lung cancers and colon cancers across all of oncology? And you might see. And we see this some sometimes you might see that, you know, your target probably don't want to put it in lung cancer. Maybe you want to put it in ovarian cancer, because it's not really important in lung kids. Yeah.
[32:55]
Brandon Anderson
What are you simulating here? So, like, are you. You say that this drug is expected to knock down this gene, and therefore it will result that you want to look for clusters where knocking down this gene inhibits tumor growth rather than enhancing tumor growth.
[33:12]
Dan Baer
I mean, that's certainly one. One way we could do it. There are other types of simulation where you might just want to ask, like, if there were immune cell here, like a T cell, which is responsible for actually killing tumor cells, what would happen to it? Or what genes would it express or what proteins would express in this particular patient's tumor microenvironment. And, you know, that's what we've called, like, these virtual cell simulations. Like, we have a model called Octo virtual cell that does this. And that can give quite powerful answers to the question of, are these drugs going to work in these patients? Because you might find, like, actually, as Ron was saying, the thing that this drug targets is just not important in this particular patient's tumor, in that there's not, like, it's not going to have any effect on the T cells or the macrophages or some other cell type there. Um, then, you know, there's the type of simulation you alluded to where you can ask the model what would happen to this patient's tumor if you were to knock down this particular target gene or its protein product? And you might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function, its ability to fight that tumor, or, you know, decrease the tumor's ability to grow, or some other readout that you think is correlated with clinical success. I just want to call out maybe like the, the simplest use case is the one where there's like a company that has a drug and they've given it to some patients and we know some of those patients responded. And then it just becomes like a question of like, has the space of patients that the model has learned via self supervision tell us that all of the responsive patients are in one of these clusters and not the other nine clusters or something? So if we know that, then there's a pretty straightforward hypothesis that this is the right cluster.
[35:19]
RJ Haneke
So that's the scenario where you would sequence something. What would you collect about those? So you have a cohort responded and one that didn't.
[35:28]
Dan Baer
Yeah. So this is getting back to something Ron mentioned earlier, which is this type of data called H and E. It's a stain, the standard pathology stain that makes these, you know, pinkish and purplish looking images. Right now what we do is we've built models that are trained on kind of all of the multimodal data we generate. But then once they're trained at inference time, all they need is an image of H and E. And that could be something that we generate in our lab, or it could just be, you know, a digital image that they have from a trial that was run years ago. And the reason that that is so powerful and flexible is again because H and E is kind of like the lingua franca of, of pathology and especially oncology. So almost every patient who's been given a clinical stage drug is going to have that.
[36:23]
RJ Haneke
You can look at the two cohorts, the responders and the not responders, and say these H and Es live in this, this part of the latent space and these H and Es do not.
[36:34]
Dan Baer
Yeah, exactly. And I think, you know, one way we've gone further than that even is given the H and E, they can say, I predict that these genes are expressed at this location in this patient. So not only do we have these clusters, these embeddings that say all of the responders to this drug are over here, all of the non responders are over there. But we can actually see, okay, for the responders, these are the genes that are expressed much more highly or predicted to be expressed much more highly in the responder cluster versus the non responder cluster. And so that adds a major level of interpretability there because we can see things like, okay, like, good. The responders are actually expressing the, the protein target of this drug. So we would be worried if that weren't the case. But, you know, we can see it is. On the other hand, we also see that, you know, the biology is very, very complicated. So kind of explaining why these simple biomarkers, like looking at a single gene or a single protein, just really don't capture, you know, what is predictive of therapeutic response.
[37:48]
Ron Alpa
Yeah.
[37:48]
RJ Haneke
So I have like a million directions I want to go here. They H and E. That actually gives you a pathway to a diagnostic then as well.
[37:54]
Ron Alpa
Exactly. Yeah. Right.
[37:55]
RJ Haneke
Yeah, yeah. And so that you, you can imagine after the drug hopefully makes it to the market, then a doctor says, oh, you have cancer, I'm very sorry, we're going to do a H and E stain of your tumor. And then we're going to put in the model and it's. It says, oh, you know, this one won't work free, but this one won't.
[38:17]
Ron Alpa
That's right. And you can. So we're using the same approach for. Actually today we're looking at many different mechanisms from different collaboration that we have in place. One of them we've announced with a company called Agenis, these are all different mechanisms. The input is still H and E using and some of the same indications. So using H and E, we're asking whether drug A works in some sets of patients, whether drug B works in other sets of patients. And so you can take that, you know, to its natural progression and say, well, okay, if you can use that same input, just H and E for, you know, experimental drugs, why not use it also for drugs that are on the market already? In a sense, the same assay can. They can be very predictive across many different cancers and many different potential therapeutics.
[39:04]
RJ Haneke
There are model, lots of models that take H and Es and go to gene expression out there, open source, whatever they do, you know, so. So I've read in Twitter, your Twitter feed and whatever, that you feel that you have a data moat.
[39:18]
Brandon Anderson
Right.
[39:18]
RJ Haneke
And so why is noetics model better?
[39:21]
Dan Baer
Sure. I mean, I think, you know, the scale of data that we've trained these models on is like, you know, pretty different from a lot of what's out there. Like the reality is there's just not that much of this kind of paired H and E plus other data modalities. Typically, you know, there are some data sets generated by academic labs, others where, you know, they might have maybe like a hundred or a few hundred patients worth of data with paired spatial transcriptomics that might even be an overestimate in comparison. We're generating these data that are, you know, multiple patients per slide, individual patients distributed across multiple slides. We've generated now, you know, more than 100 million cells spatially resolved spatial transcriptomics, that's all paired with H and E and protein as well. At least an order of magnitude larger than any of the other data sets that we've seen out there. And I think that makes like a pretty enormous difference. I mean, we've seen with our own models that if you drop down to 40% or 10% of that data used in training, the models get a lot worse. And they especially get worse at kind of generalizing to other types of cancer from the ones that they've been trained on. So I think that's a big piece of it. I also think that the algorithmic side of it is important. We've developed custom architectures specifically for training on this multimodal data. And again, my background is in computer vision and specifically in self supervised learning there. And so we've tried to develop self supervised learning approaches for these data that are really adapted for solving this problem of figuring out what is different in one patient versus another and then simulating what would happen if you were to knock down a particular gene or protein or something. So this is why we call these world models where we're trying to build models that can simulate what's going to happen if you, if you take a particular action. I think that's another big differentiator for these models. And then again, the interpretability as well is probably a third one.
[41:40]
Brandon Anderson
It's funny because you were just talking about how one of the other strategies people take for this is to do perturbations on cells and then watch the response. And now your experience, plus like your strategy here is you can simulate this sort of counterfactual perturbation idea without even having to collect the data to do that. And you can see this.
[42:03]
Dan Baer
Well, there's, yeah, there's a big piece that we haven't talked about yet, which is actually we are running perturbation experiments except they're in vivo perturbations using a platform based in mouse. We have another platform where we are, it's called perturbmap Ron, if you want to describe any of it. But basically, there's a platform for generating highly multiplexed knockouts of individual genes. So the same kind of like CRISPR knockouts that people are doing for individual cells in vitro, except when we knock out a gene in a cancer cell, that cancer cell gets injected into a mouse. It's barcoded, so we know which gene was knocked out, and it's being injected alongside, like, roughly 100 other cell types with different genes knocked out. So you end up with mice that have tumors that are barcoded that have a hundred different genetic perturbations in them. We can actually use that to validate our models and ask are, you know, what the models are predicting in humans via simulation, actually borne out when you do these perturbations. And a mouse system. Sorry, there's a lot to go into there.
[43:21]
RJ Haneke
Barcode.
[43:22]
Dan Baer
Yeah. So, sorry, barcoding. This is a technology in which an individual gene is knocked out with crispr. But also this introduces a set of protein tags in that cell that get expressed. It's a combinatorial code. So gene X might have, you know, proteins A, B, and C. Gene Y, when it's knocked out, has proteins D, E, and F. And we can tag those proteins or label them with antibodies so that when we go and look in the mouse, we know exactly which gene was knocked out based on which of those protein tags were expressed.
[44:01]
RJ Haneke
So you knock out a gene, but you also added a gene that has the barcode proteins encoded on them.
[44:09]
Dan Baer
Yeah, exactly.
[44:10]
Ron Alpa
And, I mean, the. The system's designed so everything that we're doing here is tissue level. You could be in vivo. You know, tumors that came for human that are in the form of the tumor, that are the old tissue. And then here, and then this mouse system. You have hundreds of tumors in the lungs of a mouse. And if you look at these images, it's a mouse lung with, like, literally hundreds of tumors in it. And each tumor has a distinct biology that's driven by the biology of the knockout of the gene that's being perturbed. And we can capture basically the biology of each tumor in a spatially resolved way. So what you can see is, okay, well, we have a bunch of tumors in human that. We have certain tumors in humans, let's say, don't have immune cells in them. And so those tumors are very aggressive, and they don't respond to immune therapies. You can generate those same tumors in this mouse system. And again, they don't have immune cells in them. And you can do it genetically. So you can start to map kind of the gene, the causes of gene relationships between these different immune or just broadly tumor genotypes or biological profiles, if you will, to what you see in the human. And then you can treat those mice with drugs and you see how hundreds of tumors in a single mouse responds to treatment with one drug. Or you can treat many different, let's say 50 different knockouts across a panel of mice with 50 different drugs and you can start to build this intersectional pharmacology and genetic experiment.
[45:43]
Brandon Anderson
On Twitter and in various places, I've heard you say noetic is no cell lines, no war bottles. Maybe you even said that a few minutes ago.
[45:51]
Ron Alpa
And then we just said we have mouse ball and, and injecting cell like to into lungs not under the sky.
[45:59]
Dan Baer
So yes.
[46:00]
Ron Alpa
So you know, fundamentally we think it's really important to build models that are trained on human data and we are sourcing all these tumor pick tumors to build human centric models. So that is also. That is true from the very beginning we have asked this question of, you know, let's say we want to develop a drug from the very beginning. And let's say the FDA and I know things have changed a little bit with the fda, but let's say the FDA wants you to have some data in an animal that says your new mechanism works in some animal system. What do you do? You're kind of stuck because you've now generated arguably the best data that you can in the human system. And then the FDA says, well cool, but does it work in mouse? How does it work in the mouse? And then so you have to back into this system that it doesn't translate. And so from the very beginning of the company, this has been, you know, sort of a question. And so we started, you know, probably at the same time we started generating the mouth to the human day. We started building this mouse platform with the aim of drawing connectivity between these two systems. And so we focused on a platform, we wanted a platform that one allows you to map a diversity of human tumors. Because we know that if we just run a mouse model with one tumor, that tumor has no connectivity. So in the mouse system we want to have diversity of tumors and we want to see a mapping of diverse tumor biology to the tumor biology that we're seeing in the human across many different mutations. So we license this system and been building it so you can see many different perturbations that produce a lot of the tumor biologies, plural, that you see in the human and then we also want to be able to get from this mouse system to biologically relevant, let's say targets or genes in the human as well. So one of the fundamental problems in mouse systems is we share many genes with mice, but there are a lot of genes in biological process we don't share with mice, as is obvious. And so oftentimes you run into these when you're developing drugs, it's okay, you have a target, you know, you know, some biology that works really well in mice, maybe that doesn't even exist in humans, or like, maybe that pathway is quite useless in humans. So one of the things we've started to develop that we'll share more about soon is a way to use one of these models to essentially infer human biology from the mouse directly. And so we're in silico humanizing the mouse. So all the outputs in terms of the transcriptome from the mouse are in the form of the human genesis. And so when we read out this mouse system, we were reading out in the form of human hilogy.
[48:39]
Brandon Anderson
How do you validate that? I mean, that's pretty impressive, Clayne, if you can do it. But, man, it seems like a tricky validation task.
[48:48]
Ron Alpa
In my experience, both here at Noetic and my previous employer, I could say recursion. Recursion. A lot of the approaches you're looking for when you're building these types of models is you're trying to ask whether the models are recognizing biology that you know to be true. So, for example, in the human context, we know that 12% of patients with lung cancer respond to immune checkpoint inhibitors. Do the models recognize those patients? Can they recover those patients without training?
[49:25]
RJ Haneke
Cold?
[49:26]
Dan Baer
Yeah, yeah.
[49:26]
Ron Alpa
And. And we see that. And then when you go look at those patients, we see the underlying features of those patients, maps to what we know about those patients in the clinic. In the mouse system, we have control genes. So we asked, if you look at the mouse tumor embedding space, do the tumors that should be really cold look really cold from the human inverts? Cool in the sense we have like,
[49:48]
Dan Baer
they don't have immune cells.
[49:49]
Ron Alpa
No mitozole. Yeah, yeah. And then hot in the sense of like lots of green cells. So we try to build systems where you have these handholds and then, you know, the more of these examples that you know to be true that that work that you see, the more confidence you have. Obviously, when, when you're into the regime of something very new, it's. It's still uncertain for some reason.
[50:12]
RJ Haneke
So the bridge is sort of the bridge between the mouse and the human is you build a world model on the human, you build the world model on the mouse and then you say what are the parallel structures in the two latent spaces? Is that kind of the intuition here?
[50:25]
Dan Baer
That's one thing that we're doing. But actually this is like even simpler, which is that we've trained models on human hme, spatial transcriptomics, et cetera, and then are just inferencing them on mouse H and E, which is easy to generate. And apparently mouse H and E looks enough like human H and E that human the models think is perfectly valid. H and E makes predictions about is this like immune hot, like immune infiltrated versus cold versus fibrotic versus some other tumor phenotype. And those predictions are accurate. So you know, these are like some of the controls that Ron mentioned. So you know, we know that in mice and humans and everything if you knock down tumor cells, ability to present antigens to immune cells and know those are very cold, like immune cells are nowhere near those tumors. And you know, that's exactly what we see in the mouse and that's exactly what the models, the in silico humanized models predict. And you know, then there are other examples where again we're recovering the biology that we expect to see there. And then there are findings that are novel but also make total biological sense. For instance, we have done knockouts in the mouse of let's say half a dozen genes that are all in the same pathway. So you might predict that knocking down those genes are going to produce the same phenotype because they're all on the same pathway.
[51:59]
Ron Alpa
And that is a pathway.
[52:01]
Dan Baer
Yeah, so a pathway is like protein A signals to protein B signals to protein C. And you know, there's like a chain of events that leads to, to the cell having some behavior, you know, changes in its metabolism, its growth, et cetera. So these are, I don't know if you've ever seen these crazy looking protein signaling diagrams that, you know, make you want to stay away from biology. But you know, like, you know, people
[52:27]
Ron Alpa
have at the, you know, worked down
[52:29]
Dan Baer
a lot and they know that these two proteins interact physically and signal to each other and so forth. And so, you know, one of some
[52:38]
RJ Haneke
chain of those interactions that this protein binds this protein and that causes it to upregulate a gene that causes other protein to be formed, blah, blah, blah, until yeah, you get to some phenotype, meaning the cell change the way it looks or.
[52:52]
Dan Baer
Exactly. And so, you know, based on decades of biological Literature doing experiments on these. There's a very strong biological prior that if you hit gene A, gene B, gene C and they're all in the same pathway, you should get similar phenotypes. I mean this is kind of how like old school genetics was done. And we see that with these in silico humanized mouse models, which is amazing to me as a biologist that you have a model that's trained on human data, then you show it some mouse histology and it's able to say these five different tumor genotypes all look like they have the same phenotype. And lo and behold, there are, you know, five genes that are in the same pathway.
[53:37]
RJ Haneke
So you guys switching gears a little bit because we want to talk about models in. On the on Latent Space podcast, you guys, recently there was an interesting blog post, Tario model. It's a some transformer based model. Do you want to talk about that?
[53:52]
Ron Alpa
Sure.
[53:53]
Dan Baer
Yeah. So this is like new model architecture that we developed post sort of the first virtual cell model, octovc that we developed. So Tario, this model is just a different transformer architecture. One major difference between it and our prior models. I guess if this is a model podcast, this is getting into the self supervised learning objective. So for a while, including with octovc, we were training models on what's called the masked autoencoding loss function or objective, where you have a piece of data, you chunk it up into small chunks, you mask out some of those chunks and the training task is the model has to predict the masked out chunks from the reveal chunks. Like bert? Yeah, exactly like bert.
[54:46]
Brandon Anderson
What are the chunks? Because this is multimodal and like I would imagine the different channels contain wildly different levels of information. And I remember seeing something like 99% masking in Octovc if I'm.
[54:58]
Dan Baer
Yeah, yeah.
[54:59]
Brandon Anderson
So, and I was like, that was kind of surprising because when you have, you know, 19,000 channels and maybe some of the channels are fairly like most of the signal is fairly sparse.
[55:12]
Ron Alpa
Yeah.
[55:12]
Brandon Anderson
Then it seems like it'd be either there's a huge redundancy here in your data or you really risk like just throwing baby out with the bat.
[55:21]
Dan Baer
Yeah. What are the chunks? That totally depends on which modalities we're talking about. So spatial transcriptomics, one chunk or one token might be the level of expression for a particular gene at a particular spatial location. For protein images, multiplex protein images, again, it might be, you know, the image patch for that particular protein at a particular location and so on. And you know, for like histology images, again, those are usually just patches of the image. So pretty standard like vision transformer style, the masking and the maybe surprising result that like you can and actually need to mask out large amounts of the data to get the model to learn anything interesting. If you ran the hypothetical where you only mask out like 10% of the image, maybe more like Bert, for instance in language modeling, what do the models learn then? They learn these kind of boring behaviors like how to continue an edge a little bit between two regions of an object or something. So they can learn that task very well, but they don't end up learning anything about sort of the holistic structure of the image data. And we found pretty early on at Noetic that the same thing was true with these multimodal like transformers where if you mask out a lot of it, there are actually pretty strong correlations between where protein A is expressed and where protein B is expressed. And forcing the models to learn them is really what gives it this predictive power.
[57:06]
RJ Haneke
And so Karyo though is an auto aggressive model.
[57:10]
Dan Baer
Yeah, exactly. So yeah, that was going to be the pie. And so, you know, prior models including octovc were of this masked auto encoding style training objective. TARIO is an autoregressive model which if you think about it, is kind of a particular choice of massed autoencoding. Except, you know, instead of randomly masking up onto the data, you're always asking the model to predict the next token in a sequence. We know that this is something that scales very well with LLMs, like training on the next token prediction task. And it's still an open question, how do you get models of other data modalities to scale the way that LLMs have scaled? Tarrio was not actually our first attempt, but one of our subsequent attempts to bring that autoregressive like next token prediction task into modeling spatial transcriptomics data. We found that when we used this architecture and this task, we started to see much better scaling behavior where bigger models and especially at longer context lengths were really outperforming. Here the smaller models at shorter context
[58:24]
RJ Haneke
lengths because they can see further an image.
[58:27]
Dan Baer
Yeah, that's probably a big part of it, I think. Like the, you know, there's actually a pretty subtle but very interesting result in that blog post with Tarja, which is that you only really see the benefits of using larger models when you're looking at longer context lengths.
[58:46]
Ron Alpa
And here we.
[58:47]
Dan Baer
Longer context really means again like you're seeing more tissue at once, more area at once. And I'm not like super deep into the language modeling literature, but I don't know if there's an analogous thing with like language models where like you only see these scaling behaviors at longer context. So it could be that we're finding here is that like with patient data you really do need to incorporate sort of more of the patient's spatial context to really get the models to learn these more complicated nonlinear patterns in the spatial transcriptomics and take advantage of it.
[59:24]
Brandon Anderson
Is it possible part of this is because you have some number of low expression genes and that the behavior is driven entirely by some that are under modeling of low expression genes?
[59:36]
Dan Baer
Yeah, definitely possible that the more context you have, the more likely you are to catch kind of these low expression but highly predictive genes, et cetera. I would guess it's a combination of that in larger area like we've done some experiments just like comparing model of the same amount of context but in smaller or larger areas. And there definitely seems to be an advantage to looking at larger regions of tissue as well.
[60:06]
RJ Haneke
I want to hear about. You did a big deal recently, you got a lot of press and, and I think have the distinction of being one of the only AI for bio tooling companies that is is making money. So accidental. So can you tell whatever you can disclose about that, we'd love to hear.
[60:26]
Ron Alpa
Yeah. So we were really excited to announce a deal with GSK where we licensed then octavc which is for virtual self foundation model. So we announced that back in January. It's a $50 million deal, includes an upfront payment, milestones and then separate than that also includes an annual license fee, model licensing fee. I think this was attractive deal for both parties for us and for gsk because really the deal focuses on models that we've trained already on lung cancer, colon cancer, allows us to provide them with access to the models. GSK is one of the top AI teams in biopharma, so they know how
[61:12]
Dan Baer
to use these types of capabilities.
[61:14]
Ron Alpa
They can use them for their internal use. They can also use them to fine tune on their data. So that was a really big sell for GSK as well. Because GSK and every pharma is sitting on mountains and mountains of so called translational data. So the types of data that we're training the models on come from clinical trials, pathology specimens, across many different therapeutics. Everyone's sitting on a lot of this data and it's been very hard to unlock. And so all of a sudden GSK can use our models both to do simulations and to do therapeutic discovery. But they can also fine tune the models on their data and in a way the model then becomes GSK's version of the model. This was super exciting. It was the first, at least the first announced foundation model licensing deal in the space. And frankly it was one we've been trying to do for a long time. Even before Noetic. I think a lot of companies have been trying to do these types of deals and I think it's just been historically slow for adoption on the pharma side and it's been slow to demonstrate a very clear value proposition for different types of capabilities. And so what's unique about this deal is it looks, you know, it doesn't look exactly like a software, you know, licensing framework for let's say a small amount of money with number of seats where you license. Well, it looks like a real business development deal in the industry where they're a very significant multimillion dollar cash upfront, near term payment. But then the substrate of the deal is not a molecule. It's not doing therapeutic discovery work together. It, the substrate is actually a model which is what really made this pretty.
[62:58]
RJ Haneke
Why do you think there's appetite for this Suddenly it seems like almost whiplash that. Yeah, it, you know, it seems like only a, maybe a year or two ago that bio is dying and whatever and now suddenly there's this deal. Bolts is getting a ton of attention. There's so much attention on isomorphic and
[63:18]
Ron Alpa
people are AI pill in some extent. We increasingly more, I mean maybe not totally, but increasingly more people are in pharma across the industry are seeing the value of different capabilities. They're able to use some of the open source capabilities and they're able to demonstrate the value to themselves internally. And if you look at pharma company, these companies are working on dozens and dozens of programs. And so I, you know, my opinions, just frankly my opinions, I think pharma increasingly want to be able to access models not just for one collaboration where you and I are working together on this one program. They want to be able to access the technology across the whole pipeline. And so I think that's going to create sort of a driving force for not just, you know, bespoke project driven licensing, but actual license, broad licensing where a pharma can, can access the technology in many different ways. You have therapeutic programs.
[64:13]
Dan Baer
Yeah, and I think also, you know, with the structure prediction models, protein structure prediction, binding prediction models, there is like this massive public data set, there are increasing amounts of data, people can generate data to augment that. So you know, there's enough data to the point where people can train very good models, but maybe not just on the data that any one biopharma company has. And I think that the same is true, but even more so for the types of models that we are building, which are, you know, foundation models at the patient biology level, where, like, you know, no one company, I mean, these companies may have a lot of data, but it's, you know, scattered, it's siloed. And pulling everything together to like, train an actual foundation model may not be as easy as it sounds like within a single company. Whereas we have just said, you know what, we're going to generate enough data ourselves to actually train a real foundation model. And that's the nice thing about being a startup here is like, we can make that bet that, like, you actually do benefit from generating all of this data in a, you know, uniformized way, like very high quality, et cetera, and then use that to develop and train the models. And my opinion is that you need to have data at that scale before you can even think about developing models that actually work. It's like you can't do the AI R&D like, or build the algorithms until you have good enough data set to tell you whether your favorite algorithmic idea is actually working or not. That's a major advantage for us is like, we have enough data to see, like, is my idea or someone else's idea about how to build a model, like, actually leading to improvements there.
[66:05]
Ron Alpa
Yeah, I mean, this is a good point. I mean, so like, sometimes people ask me, well, why doesn't GSD just generate your data? So we just started generating data for years. There was no model. It was like, how many years?
[66:18]
Brandon Anderson
Like how?
[66:18]
Ron Alpa
Like two years, maybe a year and a half at least before we had the first train models working. Like, maybe a year and a half.
[66:23]
Dan Baer
We had the first. So I mean, certainly, yeah, like the Octovc model. Like we trained in 2024. So yeah, that's like two years after. Yeah.
[66:33]
Ron Alpa
So we.
[66:34]
Brandon Anderson
How your four years ISIL.
[66:35]
Ron Alpa
So this is year four. And so we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples. There was no prior here that any of this would work. Like, zero.
[66:47]
Brandon Anderson
Big crazy. Like I just going for it.
[66:50]
Ron Alpa
And like, we just started generating data and like, like sourcing human tumors. Processing. We built this whole processing pipeline to get the tumors into, like, these arrays and the formats. And it takes weeks to, you know, it takes literally two weeks for a machine to run a couple slides on the spatial transcriptomics. So you've got like these two week runs where you're processing two slides and we're just churning data for months and we couldn't even train, we didn't even have enough data to train a model for like at least a year and a half. And then you're building like processing pipelines. You have to align all the data, you've got to like post process it off the machine. So we sort of just built all this and then, then like let's say 18 months later, hey, I wonder if this stuff. And then it was not like it wasn't obvious. There wasn't like, oh, we're going to like off the shelf, you know, train this on some like open source architecture. You know, we've had, we've, you know, Dan and the team have done a ton of work.
[67:44]
Dan Baer
Yeah, there wasn't really like anything major to go off of. I mean there were like transformers develop for single cell data but like incorporating spatial data into that was, you know, again there just like weren't really data sets out there that people had been able to develop on. So we do a lot of like custom model building and I enjoy that. I think people enjoy that just a lot.
[68:08]
Ron Alpa
For joining.
[68:11]
Brandon Anderson
Yeah, really unique innovation.
[68:13]
Dan Baer
It's Steve.
[68:14]
RJ Haneke
Sorry, who are you looking for? Like what kind of people?
[68:17]
Dan Baer
Anybody excited about doing ML research on again, this kind of alien landscape of data where you really have to figure out what's working from first principles. And obviously the work we do should have very, very large impact. So definitely not restricted to people who have a biology background. You know, people who just like tackling very challenging machine learning problems and are open to learning the minimum amount of biology necessary to make progress. I think would be great candidates.
[68:53]
Brandon Anderson
Talking to you guys reminds me a lot of the Leash bio labs which I know that both of you are part of the recursion mafia.
[69:00]
Dan Baer
I'm not. Yeah, well, yeah, yeah, yeah, yeah, yeah.
[69:04]
Brandon Anderson
We're working at you on the show in the future too.
[69:06]
Ron Alpa
So.
[69:06]
Brandon Anderson
Yeah, yeah, we're looking part of that. But it's interesting because both of you seem to have really similar philosophies and that you have deep convictions that you're just going to start collecting data before you know this is going to work and you are going to just brute force it. Go, go, go. And eventually it will work and you have signs. I don't know. I think that's really impressive. I wonder, is there something about recursion which is in the water, which has led to this sort of thinking of just like we're going to commit to doing things at scale and it may not work at first, you have to hit a certain point before it will.
[69:41]
Ron Alpa
I mean, we failed a lot at the beginning. Yeah, you mean hammer. Yeah, yeah. And so you, and we had, we. I said we had to build it from first principles and we really did. And so we spent many years trying to figure out like what should the data look like. Ian, myself, we're all involved in kind of platform development, how to design these data sets, how to design the experiments. Iterative cycles over the years, seeing, you know, things that did work, things that didn't work. And so at the end of coming out of recursion, I think what a lot of folks there had was an understanding of what are the things we need to think about. So that even if I want to design a different data set today that's like totally different, what are the things that we learned and we had to learn over mistakes over, not mistakes, but like trial and error basically over that many months that we would try to insert in our new approach. And so I don't know that every, everything that I've predicted at noetic in terms of like how to generate the data set has been important necessarily. I know that we could start at the very beginning and say, okay, well let's make sure we do these 10 things. I know every one of these 10 things was important before. Let's at least make sure we do these 10 things. I don't know that all 10 things are important for us today, but I would presume that many of them are and it lets you sort of leapfrog that process of trial and error a little bit. Certainly we do have trial and error still, but hopefully we're not having to, you know, solve like, you know, 15 problems. Maybe we're only solving, you know, three problems, four problems overtel.
[71:13]
Brandon Anderson
So for small biotech startups which are probably in the AI space who are collecting their own data, own data mode, like do you have any advice or any suggestions about how to be more successful there?
[71:25]
Ron Alpa
I think you sort of need to, I mean you think ahead to, okay, what am I trying to do on the machine learning side and like what is the right data for solving this problem? I think oftentimes I see like a lot of companies are like, okay, well I want to generate X Data set. I'm just going to generate X Data Set and I'm going to do machine learning on that. Like that might not be the right data set, you might not have designed it. The Right way. You know, it doesn't follow that like any data set is a machine learning dataset. It doesn't pull that data says yes all the problem you're trying to solve. So and I, for me it's really, and even felling like it was okay, what, what problem are we trying to solve and then what are the data that are going to help solve that problem? And rather than like, you know, going from, from the, you know, data directly
[72:13]
Dan Baer
to, to try to solve, I also, sorry, I also had a quick piece of advice which is like, you know, pay attention to where the technology is and you know, where it's changing rapidly. So you know, I finished my PhD in 2016. I did a lot of looking at spatial RNA like via this technique called in situ hybridization. Same technique that is like at the base of what we're doing, I could look at maybe two genes at a time on a single sample. And that took me a full week of manual work. And you know, I came to NOETIC like for 5 years later, 6 years later and all of a sudden, you know, there are platforms where you can look at a thousand genes or 20,000 genes at once. You know, it's a single machine that can run this assay. It's expensive but it's just like data beyond the wildest dreams of dan bear in 2016. And that is only improving like rapidly. So I think it's important to see what the technology of today, you know, allows and also where it's going in terms of what data to generate and
[73:26]
RJ Haneke
what does that pitch look like? So I'm going to generate data for a year and a half and then I spend $50 million and then if
[73:34]
Ron Alpa
it wasn't 50, it was maybe, it was maybe closer to 10. But I, if so yeah, I mean it isn't just so. Yeah. So you have to do that. If you, if, I mean if you're going into regime where there's no data. Yeah. And you want to do something different then I mean there's no shortcut to it. Right. You're going to have to generate the data set and so you're not going to know the answer until it's there. And I mean, and that's why a lot of companies are not going into that space where, where there are no data cents because you know, I think it, it can be challenging to, to do that.
[74:07]
Brandon Anderson
Yeah, I mean I think a lot of smaller biotech AI startups will try this pattern where they first will either start with a public open source data set or they Will try. A pilot will internally collect a small amount of data and see if something works or something it doesn't. And oftentimes there's almost like a critical point where below this you're just not going to get a new signal. And you have to have conviction that you need to collect up to a certain point before you start really driving something late. Fundamentally valuable. Yeah, yeah.
[74:41]
Ron Alpa
I mean, imagine trying to train a foundation model on not enough data and then that's. It's sort of your clinical trial. Right?
[74:50]
Brandon Anderson
GPT2, GPT3. GPT3, you know, GPT1, 2 and 3. Like there was a clear progression there. As each one of them, you could see there was something which worked with scale and there was this insight to, oh, we're going to scale itself. You know, sometimes a biological data, like the process of collecting lots of data is just very expensive to begin with. You can't just take something off the shelf and expect that you're going to hit the threshold of, you know, GP3, like usefulness.
[75:16]
Ron Alpa
Yeah, yeah.
[75:16]
Brandon Anderson
So, yeah, it takes some conviction.
[75:18]
Dan Baer
It definitely takes conviction. I think, you know, it also takes sort of like a scientific belief. Then there's a lot out there like that we just don't know yet. And that you're not going to capture the biology you need to by having right now like an agent that reads all of the biological literature. Because again, that's just like a tiny slice of what's out there. Like this is. I don't know if it's a great analogy or if I'm going to botch the history here, but like in astronomy it was required, like Tycho Brahe, like, collecting this enormous amount of astronomical data at his observatory that then was the substrate for Kepler, you know, figuring out the first laws of motion of the planets and then, you know, that was superseded by like Newton's laws and so forth. But like, I, I don't, I sometimes don't know how you even get started without like this large repository of really high quality data. And you know, maybe there's like a tragedy of the commons problem here of like, who's going to generate that data and who's going to capture the value of it. But I'm very glad that we're, we're taking that bet and, you know, we're seeing it. Payoff.
[76:30]
Ron Alpa
Yeah, I mean, this is not my expertise, but if, you know, hypothetically speaking, how much of PDB do you need to train?
[76:38]
Brandon Anderson
I mean, there was some people, I argued that, yeah, you can get some pretty good models with I think 1%, 1% delay, and there are people going back in the 1990s argued that the PDB was already complete in the sense of, like, if you had a sufficiently smart algorithm, you could have done a pretty reasonable job of protein folding even back then.
[76:58]
Ron Alpa
Interesting.
[76:59]
Brandon Anderson
So you don't need a lot to get a pretty big boost. But the community was sort of independently collecting PDB data for quite some time without necessarily being convicted that this was going to lead to solving protein folding.
[77:14]
Dan Baer
Yeah.
[77:14]
Brandon Anderson
But then it was also usually quite. Most of those structures were quite useful in and of themselves. So maybe that's their charter. Point is, oftentimes just knowing a protein
[77:24]
Ron Alpa
was very helpful for some reasonable dataset. And we did see. We did see a transition from, like, early data. But how many samples did we need? I'm guessing probably on the order of a few hundred. Before there was like.
[77:35]
Dan Baer
Yeah, there was a. There was definitely a moment, like, very soon after I joined where, like, we. The data set just kind of doubled in size overnight because there was, like, a huge bolus. And, like, the models immediately got a lot better at that point. And, you know, now we'd run these more controlled experiments of seeing, you know, what happens if you train on 10% of the data versus 40% versus 100%. What happens if you hold out all of the pancreatic cancer or all of the breast cancer. And so, you know, we have a much better idea of what kind of diversity and scale we need now. I guess I would say if we were sticking to cancer, maybe we're not, like, that far off. I think, you know, again, if we end up generating a few hundred patients in a bunch of major and, you know, some minor indications which we're, you know, gonna do this year, like, maybe that's enough to generalize to kind of all cancer, because there is a lot of shared biology in, you know, cancer and immune cells across different tissues and different, you know, mutations and so forth. But if you think about all of the disease biology that there is for a model to learn, you know, maybe that's like another order of magnitude.
[78:50]
Brandon Anderson
But I mean, even being able to solve all cancer biology would be a pretty impressive.
[78:54]
Ron Alpa
Yeah, to cure cancer would be great.
[78:56]
Brandon Anderson
Oh, if it's all cancer biology. I did not say cure cancer. Those are two different things.
[79:01]
Ron Alpa
But, yeah, at least if you go, Madeline, just sort of a. Like, just take one drug. If you could look at one drug mechanism across the whole of oncology, that's incredibly powerful. I mean, imagine what Merck has done with keytruda Like, Merck has run hundreds of trials with keytruda. Like, it might even be over a thousand trials of keytruda in different populations to find, you know, all these different indications. Okay, the subset of ovarian cancers, the subset of lung cancers, the subset of colon cancers. That's all been done, you know, by enrolling trials. If you can look at that biology from model embeddings and at least have a very well defined starting point for, okay, if I'm going to run a trial, it doesn't have to be as broad as it would need to be. If I didn't have any answer, then that can be a really powerful tool for a diversity of mechanisms.
[79:56]
Dan Baer
Yeah, maybe it just like last point, like going back to the virtual cell hot takes like, you know, if your goal is to build like an actual mechanistic model of an individual cell and then build up from one cell to an entire tissue and then, you know, tissue to patient and so forth, like, you might need a lot more data and a lot more data modalities than, you know, just like gene expression or something like that. But, you know, we're taking much more of like a top down approach of we're trying to first solve the problem of what is determining heterogeneity among actual patients and which of that variability is predictive of drug response. And my intuition is that you don't need to model the mechanism at the subcellular level necessarily to solve that problem of which patient should get which drug or, you know, which targets are important in which patients. And I saw a similar debate play out in neuroscience and computational neuroscience, where for a long time people were really trying to build these biophysical models of individual neurons and then they were going to stitch them together into models of, you know, the brain and so forth. And what actually ended up working in, you know, in terms of building computational models of the brain and behavior is this abstraction. You know, we're just going to treat individual neurons as, you know, linear nonlinear units and put them together in neural networks that are connected by linear weight matrices and stack a bunch of layers together and then build neural network models of the brain that abstract away all of the details of biophysically what a neuron is doing. And those are now by far the most predictive models of how a given neuron is going to respond to real world stimuli in a real brain. And I think that my bet is that the same is going to be true for these models too. Is that like by modeling sort of at the level of functional tissue, where you have A bunch of cells interacting in like a disease context, that that's going to get you to the problem of predicting kind of the patient level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together. Yeah, that makes sense to me.
[82:24]
Ron Alpa
It's a good analogy. I like.
[82:26]
RJ Haneke
Do you have any call to action for the listeners?
[82:30]
Ron Alpa
Yeah, I mean, I would say one. Everyone should be excited about biology. You know, sometimes. A lot of my hot takes on on X recently are just that. I feel like there's a huge amount of enthusiasm in sort of like the mainstream tech ecosystem and like, people aren't really following a lot of like, what's happening in the biology space. But at the same time, like, you're hearing, you know, frontier labs saying we're going to cure cancer, and people should actually look at the folks working on curing cancer or working on aging or working on areas of biology. These are really exciting, you know, problems. There are real, like, significant ML problems in the space. One call to action is with love for people to just like, be more stoked about learning about applications of machine learning and like, biological sciences and like, solving some of these hard problems, because I think these are the problems that are going to like, massively impact humanity in like the next 10 years. And we're just like, really at the very beginning, like, you know, maybe we're in, in. In the like, first inkling of the ChatGPT moment for bio, but it's like very much just the very beginning, so
[83:35]
Dan Baer
we'd like to catch it while you connect.
[83:37]
Ron Alpa
Yeah, yeah.
[83:38]
Dan Baer
In line with that to like, really dig in and learn more about the details. I think, you know, a lot of the times it's presented as we have these protein folding models, we have these binding models, you know, we have AI for science agents that are, you know, like reading all of the literature and automating these computational biology workflows. And I think it's important to realize that there are a lot of problems in AI for biology, AI for biochemistry, et cetera, and some of them, and they're very important. But like, solving any one of those is not gonna like, solve the problem of how do we develop better therapeutics. And, you know, we're focused on, you know, a pretty particular slice of that process which is again, translating things that we know work well in some patients into actual, like, successful drug trials where we know exactly which patients to give them to. And that requires building foundation models at a particular level, the patient level. But people should not be under the impression that this is all going to be solved immediately because AI agents like LLMs are going to just read the literature and figure out what the right drug is. There are a lot more data to generate, there's a lot more ML problems to solve, and there's the need to translate those methods into actual successful drugs. And there's a lot of different places to contribute.
[85:11]
Ron Alpa
It's a lot to do. Yeah, great. Thank you very much. Here we are.