Summary9 min read

Practical AI in Healthcare – S1E32

"AI-Driven Drug Discovery at Sanofi"

Guest: Matt Truppo, PhD – Global Head of Computational and AI Strategy, Sanofi
Hosts: Steven Labkoff, MD & Leon Rozenblit, JD, PhD
Date: April 12, 2026

Episode Overview

This episode dives deep into the practical, large-scale implementation of AI in pharmaceutical R&D at Sanofi, focusing especially on discovery and early research. Matt Truppo details how his teams have operationalized AI-powered tools to radically accelerate drug discovery, target identification, and molecular design, while also openly sharing challenges and key lessons from making AI work in a major pharma enterprise.

Key Discussion Points & Insights

1. Sanofi’s AI Integration Philosophy

[02:30 - 03:21]

Sanofi is unique among large pharma: AI is being integrated “across the board” in discovery and research, not just as isolated pilots.
The complexity and cost of bringing drugs to market is “stubbornly high;” AI is viewed as the only realistic way to navigate and master this complexity.

“The complexity necessitates using tools that can navigate complexity really well. And what we're finding … is how specifically we're doing that across the whole value chain of R&D.”
— Matt Truppo ([02:30])

2. Matt Truppo’s Origin Story & Motivation

[03:50 - 04:58]

Truppo’s interdisciplinary background (chemical engineering, bioengineering, chemistry) primed him to work at innovation intersections.
He’s a technology adopter, not an AI native, with a career-long focus on using new tech to accelerate biomedical research.
Saw the inflection point of modern AI in pharma about two years ago, when increased computational power and new models could finally deal with the sector’s “complex data sets, very large volumes of data.”

3. AI for Target Identification—A Groundbreaking Production Use

[08:39 - 14:11]

Sanofi has moved AI-driven target identification from the theoretical to routine, scalable production.
Developed TIE (Target ID Engines), which use multi-modal evidence—genetic, transcriptomic, pathway, clinical, safety, novelty data—integrated via machine learning and knowledge graphs.
These engines centralize and standardize data so scientists can rapidly compare potential targets with apples-to-apples evidence.

“If you get the target wrong, nothing else you do matters ... The way we approached this was to try to consolidate and centralize ... the data sets that are used ... and the ability to interrogate those models’ information for scientists across the R&D spectrum.”
— Matt Truppo ([08:51])

Architecture includes:
- Multi-layered data integration (internal and external data)
- Semi-supervised models + knowledge graphs
- Natural language interfaces (LLMs overlay) — enabling non-experts to “talk” to the system and query data directly

“It lets scientists ask natural language research questions and that's really what we hope—to drive utility across a broader swath of the organization, rather than just the hardcore informatics and geneticists.”
— Matt Truppo ([13:22])

4. Upskilling Workforce: The “Bilingual” Approach

[14:49 - 16:27]

Major investment in “AI literacy”: Sanofi developed internal courses, starting with executives, then rolling out more widely.
Aims for a “bilingual workforce” where everyone speaks both clinical/scientific and digital/AI languages—crucial for adoption and success.

5. Results: AI Dramatically Accelerates Discovery

[17:08 - 19:25]

In first 12 months after deploying AI tools, identified and moved 10 novel drug targets into pipeline—a “first” in both number and speed for Sanofi.
Extended engine to “multi-TIE” for multi-specific target combinations (crucial for newer biotherapeutics): screened 30 million target pairs in a few days, grew preclinical portfolio by 40%.

“It's the first time in my career … that there are 10 novel targets that to our knowledge no one else is working on. … We ended up increasing overall the size of our early stage preclinical portfolio by 40% in part due to these efforts.”
— Matt Truppo ([17:28], [18:27])

AI delivers value by
- Handling immense data volumes
- Weighing disparate modalities
- Uncovering novel biomarkers– insights not possible by “just talking about data in a room”

6. AI for Small Molecule Drug Design ("AI Auto Lead")

[22:15 - 27:06]

Building a fully automated closed-loop (“design-make-test-analyze”) pipeline, integrating AI predictive models, classic QSAR, structure data, and automated lab robotics.
The “brains” (AI) are live—selecting molecules that are then manually synthesized (pending automation hardware).
Results so far: Out of 200 AI-suggested compounds, 75% could be synthesized, 1/3 were biologically active; these are exceptionally high rates in pharma R&D.

“The practical results are that the first program we ran through these brains where there were 200 compounds that were suggested for synthesis, ... 75% of those were able to be synthesized in the lab. … 1/3 were biologically active around 10 nanomolar range in terms of activity, of binding.”
— Matt Truppo ([25:57])

7. Human-AI Collaboration—Autonomy and Safeguards

[27:06 - 29:04]

Some preclinical steps (compound suggestion, synthesis) can be automated.
Human scientists always make key “move forward” decisions, especially in later stages with safety/clinical implications.
About 50% of Sanofi’s small molecule pipeline expected to be suitable for this AI-driven, automated approach (rest limited by chemistry/automation constraints).

8. Data as the Ultimate Advantage (“The Moat”)

[29:35 - 34:41]

Curated, proprietary biological data is the biggest persistent competitive advantage. Sanofi leverages huge, unique antibody datasets from past acquisitions (e.g., KYMAB, AB Links) to train proprietary large language and protein models—outperforming public tools like AlphaFold on key metrics for structure prediction.
The quality and structure of data—“highly curated”—trumps model architecture alone.
Major efforts to selectively curate and annotate legacy data where it confers real competitive benefit.

“Companies that are curating biological data for decades ... are going to have a persistent structural advantage. In a sense, what's emerging is that a competitive moat is going to consist of access to highly curated data.”
— Leon Rozenblit ([32:38])

9. Specialization: Nanobodies and Modular Biotherapeutics

[35:08 - 37:58]

Sanofi’s specialty in nanobody technologies allows for modular, multi-targeting drugs with new functionalities.
Example: A clinical-stage drug with five different binding modules (“beads on a string”), enabled by AI design and unique data assets.
Proprietary AI models are starting to deliver de novo nanobody designs, though with more promise than ubiquity—field is working toward routine “zero-shot” design but not there yet.

10. Bringing AI from Molecules to Models—The Next Frontier

[38:37 - 40:48]

Using AI not just to design molecules, but to predict animal and human “PK/PD” (pharmacokinetics/dynamics), toxicity, and even simulate digital patient twins to forecast clinical trial outcomes.
Still “marching” toward the full vision: integrated modeling from in silico through to the clinic and beyond.

“The holy grail would be, can you take all of this information ... and run your digital patient twin model in such a way that you predict what the clinical trial result would be with higher confidence before you even synthesize those molecules?”
— Matt Truppo ([39:50])

11. Candid Look at Challenges

[41:42 - 48:49]

Data integration is the hardest, longest-running challenge, especially after M&A (“Sanofi is a company of hundreds of acquisitions ... data doesn’t get integrated on its own”).
Explainability problems: Without transparency, trust—and adoption—is low; explainable AI is prioritized, especially in higher-risk steps.
Change management: Upskilling and adoption at scale remain tough—tools must be “ready,” and widespread adoption requires more than technical wins.
Workflow handoffs: Many breaks in “automating” the process remain; connecting outputs across teams end-to-end is a key future focus.

“You could have a great model and it accelerates a part of your workflow ... but then the output of that is still manually handed over to the next team ... We need to start connecting each of those aspects ... that's where agentic AI systems come in.”
— Matt Truppo ([45:46])

12. Failure Modes and Pragmatic Risk Management

[47:04 - 48:49]

Mindless KPIs (e.g., “deploy 20 models”) can mislead; focus needs to be on real outcomes, not just numbers.
Not every idea can or should be funded; impact and technological readiness must align.
Some failures and risks are familiar from prior technological shifts—core lesson: pick “bets” with strategic ROI and feasible timelines.

13. Closing Advice to Other Pharma Leaders

[49:57 - 50:41]

Data foundations are everything:

“Get your house in order in terms of data foundations ... it’s the not sexy, boring sounding thing, but is absolutely critical and foundational to everything else you’re going to build.”
— Matt Truppo ([49:57])
Partner smartly: Don’t go solo—focus on your unique capabilities, partner for others (data, talent, tech).

Notable Quotes & Moments

“Just putting the letters AI in front of something obviously doesn’t solve the problem. … I have never seen a technology impact more spaces of life in general as rapidly as this one.”
— Matt Truppo ([05:35])
“The success rate is in large part determined by the target you selected at the very beginning ... if you get the target wrong, nothing else you do matters.”
— Matt Truppo ([08:51])
“Trying to make your workforce bilingual to understand these things and be comfortable with them.”
— Matt Truppo ([14:49])
“We ended up increasing overall the size of our early stage preclinical portfolio by 40% in part due to these efforts. So it was a huge impact.”
— Matt Truppo ([18:27])
“You give physical form to the AI ... The outcome is intended to be acceleration of small molecule drug discovery ... an accelerated closed-loop design, make, test, analyze cycle.”
— Matt Truppo ([23:22])
“Curated, proprietary biological data is the biggest persistent competitive advantage ... a competitive moat is going to consist of access to highly curated data.”
— Leon Rozenblit ([32:38])
“The holy grail would be ... predict what the clinical trial result would be with higher confidence before you even synthesize those molecules. We are not there yet, but ... we’re slowly marching to that state.”
— Matt Truppo ([39:54])

Suggested Key Timestamps

02:30 – Why and how AI is integrated across Sanofi R&D
08:39 – Launch and architecture of AI-powered target identification
13:22 – Natural language interface for scientists
14:49 – Organization-wide AI literacy program
17:08 – Real-world result: 10 novel targets in 12 months
22:15 – Automated AI drug design, AI Auto Lead
29:35 – The true advantage is data; how curated data creates the “moat”
35:08 – Nanobodies: Modular, multi-specific biotech drugs with AI
38:37 – Using AI to predict toxicity/PKPD and build digital patient twins
41:42 – Data, explainability, change management: the hard parts
49:57 – Field note advice for other pharma leaders

Tone & Style

Candid, open, highly scientific but accessible; mix of “nerdy” technical depth and real-world business pragmatism.
The hosts are impressed, occasionally awe-struck, but keep an honest critical lens and draw out “field lessons.”

Summary for New Listeners

This episode is a must-listen for anyone serious about how AI is already transforming drug discovery in pharma—not as hype, but with measurable, practical outcomes. Sanofi’s success, per Matt Truppo, comes not just from deploying algorithms but by building a data-driven, AI-literate organization, smart partnering, and relentless focus on integrating people, process, and technology. The future, as shown here, is about connecting every layer of data and expertise to make drug innovation faster, better, and more accessible—while being candid about the vast, ongoing challenges.

Loading summary

Transcript67 lines

[00:04]
A
Welcome to Practical AI in Healthcare, the podcast that cuts through the noise to spotlight real world solutions delivering real world value. From patient care to clinical research, from life sciences to patient engagement, we focus on what's truly moving the needle in healthcare. No hype, no theory, just practical insights where AI is making a true impact. Welcome aboard and let's get to it.
[00:28]
B
Foreign.
[00:35]
A
Hello and welcome to this week's edition of Practical AI in Healthcare. My name is Dr. Stephen Lapkoff and I'm here as I am Every week with Dr. Leon Rosenblut. How you doing, Leon?
[00:43]
C
I'm great, Steve. Excited to be here with our wonderful guest.
[00:47]
A
Yes, speaking of a wonderful guest. You know, for the past 30 episodes or so, we've been hearing a constant pattern that has to do with finding AI and the real value that it is that people are finding in it. And this week we have the first of a two parter episode. It's going to be with Matt Trupo, who is the global head of computational and AI strategy at Sanofi. He and his team are doing some really interesting things in both discovery and R and D and we're going to hear about those things today. And, and what's really amazing, frankly, is that Sanofi is allowing Matt to be very open and have this discussion. Most pharma companies don't allow for that type of open discussion. They consider things like this to be a business advantage. But Matt and Sanofi are being very open today and they're going to talk about some of the use cases that are really finding value today. We'll be focusing in on discovery and for that end of research. And then later we'll get into the development piece in the next episode. So Matt, welcome to the podcast.
[01:55]
B
Thank you so much, Steve. Thank you, Leon. It's great to be here.
[01:58]
A
So, to get us off the ground here, you know, you told us something which really caught my attention in our pre work when we spoke to you, which is that, you know, unlike some farmers that are out there, you guys have been integrating AI pretty much across the board in almost every aspect of the things you're doing. I know that's aspirational in many places, but you guys are rolling up your sleeves and doing it. And before we get to our first real question, which is usually our, our hero story question, maybe you can unpack that a little bit for us.
[02:30]
B
Sure. No, happy to. And, and I think really it comes down to what is the challenge we're trying to face and what are the best tools to, to try to address those Challenges. And, and it sounds kind of simple to say, but the challenge we're trying to face in pharmaceutical R and D is what we do is super complicated and really borne out in the fact that the timelines have been stubbornly consistent and long for our industry and the cost has been ever increasing to get these insights and medicines to patients. So that complexity necessitates using tools that can navigate complexity really well. And what we're finding and what we'll, you know, get into a little bit today is how specifically we're doing that across the whole value chain of R and D, starting from target id, moving into molecular design, moving into even predicting how we might do clinical trials and monitoring those trials.
[03:21]
A
Wow. So I'm looking forward to this. I mean, as, as you know, I've spent most of my career in, in the life sciences. I've been with Pfizer, with AZ, and with bms. And I have to say that I haven't seen anybody who's as comprehensive about this approach as, as what you're about to talk about.
[03:38]
B
So as we just before we get
[03:39]
A
into the, the meat and potatoes, we'd like to ask our guests their, their origin story. How did they get where they are? What was it that gave them their superhero cape, if you will?
[03:50]
B
Sure, sure. So I'm a little bit of a scientific mutt, I say. So I'm a chemical engineer, a bioengineer and a chemist by academic training. And that was very intentional. So I've always worked and been interested in the interface of disciplines where you're really likely to find innovation. And it also keeps it quite interesting. You meet folks of different backgrounds with different proclivities, different talents, and they just think about how to solve problems in different ways. So I love bringing that together. This is my 25th year in large Farm R and D, so I've been in the space for quite some time, been really fortunate to work on great teams that have brought a host of medicines to market to impact patients lives. And really for me, what's grounded and directed my career and the different paths I've taken is the adoption of any technology that can accelerate the transformation of fundamental biomedical research into meaningful therapies that save and improve human lives. I'm not a digital or AI native. It's something that I've kind of learned a little bit along the way. There are people who are way smarter at this stuff than me, but again, I'm happy to adopt and learn anything I need to try to accelerate medicines to patients.
[04:59]
A
So it's Interesting. You and I share that kind of common thread as well. I also not an AI native, but I saw very acutely that when the AI was hitting the scene in 2023 that it was going to be a game changer in healthcare and in the life sciences. And I took a dive in there as well. What I'd love to get into here is, you know, when did you decide
[05:22]
B
or when did you see that this
[05:24]
A
was going to really change? Because AI has been around a long time. You know, I started as an informatician close to 35 years ago. And we were doing AI back then, but not like we are now. So when did it, when did the bug bite you?
[05:36]
B
Yeah, so it was really, I'd say around 18 months to two years ago that a couple of things came together. And again, one of them is the problem statement. I talked a little bit about the complexity, the cost. We're on a truly, I would say, unsustainable journey and patients demand and need better. We can't continue to raise the cost and the time to bring these important therapies to patients. And it's not that we haven't tried. So due to the complexity of the disease and that leads to a concomitant complexity and the increase of complexity of the chemical matter, we need to actually invent the drugs. We need to invent it means that the bar keeps being rays with respect to efficacy, ceilings of how the drugs work with respect to durability of response. So we're kind of, for the last many decades, been running just to stand still in terms of keeping up with the patient need. So that for me was one wakeup call and then that led to, okay, well, where can you find answers to try to address some of these problems? There's the continual steady improvement that we all are used to and do. But then as you said just a few years ago, capability came on the scene that really transformed the way we could think about how we deal with complex issues, complex data sets, very large volumes of data. And that's something that we deal with in R and D, pharmaceutical R and D all the time. And these are these AI tools. And I do want to say, and we'll get into this as we go, but just putting the letters AI in front of something obviously doesn't solve the problem. Right. That's a bit of a fallacy that sometimes we, we can fall into. But I will say I have never seen a technology in impact more spaces of life in general as rapidly as this one. So the overall tools and capabilities are clearly amenable to many spaces. It's really about figuring out which ones you want to tackle first, second and third.
[07:29]
C
So Matt, I love that you're starting off with a problem focused perspective. And that pragmatic lens is going to be really important because what you're about to describe isn't theoretical. You got proof points, right. And you, you're building stuff that's really fascinating. And I will start with the one that I think is going to surprise our audience the most because surprised me, you're actually using AI deeply for target identification in a production environment. And for those of us in our audience who aren't as close to life sciences, target identification is a really deep problem. It's figuring out which molecule is going to do something useful. And the combinatorial space of discovery is enormous. Right. It's a library of Babel problem. Like how do you find something in this combination that would potentially work? And this some deep approaches and we've seen, you know, it powered solutions that have made this easier, but so far they've been either theoretical or a little bit closer to the research side. And you're, you know, you're starting to do this in the R and D pipeline further down the pipe. Tell us what you built and you know, tell us how you did it.
[08:39]
B
Sure, sure. And, and I'm, I'm obviously representing on this call, like many teams that have actually done the real work in building these things and just phenomenal results so far. So I'll tell you a little bit.
[08:50]
C
You're just a humble country ch. We get it.
[08:52]
B
I am indeed. I am indeed. So, Leon, you make a really good point that target selection, target identification, it's something classically large pharma has not been the greatest at. Typically this comes out of decades of laboratory work in academic labs and it still in large part comes from there. It's a really important part of the process though. And the reason is if you get the target wrong, nothing else you do matters. You can spend a decade and hundreds of millions and it won't actually change anything about the success. The success rate is in large part determined by the target you selected at the very beginning of the project. So the way we approached this was to try to consolidate and centralize a lot of the work that's done, both in terms of the data sets that are used, but also in terms of the models and then the ability to interrogate those models information for scientists across the R and D spectrum. So I'll talk about a couple of examples and one is in the tools that we call TIE or target ID engines. These are AI driven target identification and prioritization engines. They are centralized in a target disease and systems biology group which invented these capabilities within Sanofi. And they integrate multimodal evidence, meaning genetic association data, transcriptomics, both bulk and single cell transcriptomics. They integrate known biological pathways, they integrate safety signals for biology that's been observed in clinic, and they also integrate novelty assessments. So is anyone working on this already? Is this completely novel? What information do we have in the external world? So this data, which is all multimodal, is combined in a scoring framework to support target selection and credentialing. The centralization of both the data sets and the actual models and activities allows us to rapidly compare targets for a given disease area and also allows us to have easy evidence review all in one place. So you're comparing apples to apples, in other words. And the way that this has been approached is the semi supervised machine learning model. So it incorporates knowledge graph, it incorporates informatics, it incorporates test sets that are withheld from the training set to sort of validate that this model is doing what we want it to do. And in that way we're able to generate confidence about novel targets. And it really comes from that confidence from recapitulating known targets of interest. Right. If we can recapitulate known targets with WBT data, then it gives you some confidence that the system might be working. But the proof will really be in the clinic. The proof will really be when these things reach the clinic. Do they end up resulting in. Yes, these targets were indeed successful for the disease area.
[11:38]
C
So, Matt, you've described or alluded to a really sophisticated multimodal architecture. We're using knowledge graphs to pull together transcriptoma proteomics, spatial transcriptomics, genetics data. Do can you give us a. Just a little bit more on the architecture that enabled this to happen? Right. I mean, I, and I understand some of this may be proprietary. So give us just a little bit, just give us a taste. Right. You know, I think that this is a sufficiently nerdy audience that would appreciate knowing a little bit more.
[12:10]
B
Sure, sure. So, and I'll, I'll keep it to the publicly, of course, publicly appropriate savings. But so I think there are multiple layers that had to be brought together technologically to get this to work well. And I'm happy to talk about where it doesn't work well as well because there's a long way to go. So the first piece was bringing together those multimodal data sets into one environment. So these are things that these are data that are obtained on the outside, it's data that's obtained on the inside from our own clinical studies. That connection is a challenging one to be able to actually pull back clinical data in a way that is appropriate, in a way that is approved, in a way that the use cases are predetermined so that we can pull that back for secondary use. That all has to be sorted out prior to doing anything with the data to begin with. So once that data is in the appropriate environment where you can start developing and building models I mentioned, it is not just a single architecture for the models that are being built. So in some cases those models are knowledge graphs, which I mentioned. In some cases it's semi supervised learning where we are training based on known results and outcomes in the clinic and then we are making sure that the model is reinforced, to behave, to learn those behaviors. As it builds models out, there's another layer as well which we've moved to recently, which incorporates a large language model, kind of like the ChatGPT and Claude and things that we're all used to dealing with sort of in normal life now as a way to interact and talk with that data. So this is an overlay layer. It is not the ground truth reasoning layer, but it's an overlay layer to talk to that data and to talk to those models so that non experts can gain insights from the tools as well. So it lets scientists ask natural language research questions and that's really what we hope to drive utility across a broader swath of the organization rather than the hardcore informatics and geneticist. So Matt, I guess the question that's
[14:12]
A
coming to my mind is to do all this and one of the things that has caught my attention and it's, you know, as I think you know, Leon and I are both on a program at the DCI network and in order for all this to work and what's really impressed me about the fact you're doing it across the board is that the degree of AI literacy, the ability for your staff to understand how to interact and work with these things, is mission critical. And I have to tell you that that's something that I didn't see in previous places I've been. There was a lot of attention on the models, there was a lot of attention on the tech, not a lot of attention on the people. How are you guys handling that piece of the equation?
[14:49]
B
Yeah, that's a great question. And actually we took an extremely proactive approach with our human resources, what we call people and culture, colleagues as well as our digital organization to a couple of years ago, say, how do we start the transformation of becoming a bilingual workforce where the other language is the digital component and not everyone has to be an expert in it, but we all have to be literate enough to understand where the pitfalls, where the success is, what questions to ask along the way. So we actually developed a couple series of courses first starting with the executives in the company to actually train and say. And I was part of a test group for the very first one of those courses, which was a lot of fun as that was being developed. It's always good to be your own guinea pig. And that has been deployed to hundreds of executives at the company and now is being pushed out in other forms, meaning online tools to thousands of employees. So that was a real part of the ongoing transformation of making your workforce bilingual to understand these things and be comfortable with them.
[15:55]
A
And that is a. It's amazing that you guys have accomplished that because, you know, I've been in meetings where a lot of senior executives in, across multiple functions. Now I worked much more in the commercial, commercial side or the. And also on the R D side. But wherever I'd been, you know, many of the senior leaders did not have that perspective and there wasn't much effort to help get them there. So it, you know, kudos to Sanofi for, for thinking that way and taking the, taking the bull by the horns and getting everybody trained up and getting a literacy thing going.
[16:27]
C
Yeah, I agree. The literacy piece is fundamental. I actually want to come back to it. There's one other thing you mentioned. I want to put a stake in the ground for potentially further exploration. So Steve and I both worked in data integration for decades and we just recognize this is an incredibly deep, complex problem. But one interesting thing that's been coming out with modern AI architectures is the, the impedance mismatch between different data sets is getting kind of semi magically reduced as you put more semantic layers on top of it. So I want to come back and ask, like, is that something you're experiencing? But, but I actually want to push this back to pragmatics.
[17:09]
B
Right.
[17:09]
C
I mean, the big thing that you told us that just blew my mind Is you got 10 plus novel targets identified in 12 months. I mean, dude, help us understand the magnitude of that. What would that have looked like before?
[17:21]
B
Yeah, yeah. So an important part of this is actually tracking real world success and how does that translate to value the return on investment.
[17:28]
A
Right.
[17:28]
B
Because you could go, as I said, in any number of directions and we'll get some right, we'll get some wrong, but you want to be able to track and monitor. So we were able to put after these tools were deployed within the first 12 months, 10 novel targets into our pipeline. We're working on those now. It's the first time in my career that I've seen in such a short span of time that able to be achieved internally within a pharmaceutical R and D organization that there are 10 novel targets that to our knowledge no one else is working on. So that was pretty impressive. But then beyond that, Leon, I would say that we've moved into a space where we're now coupling this with what are we uniquely good at in the lab, in the wet lab. And one of the things we're uniquely good at is multi specific targeting drugs with biologics design. And so we've then developed a second layer of this target ID engine called multi tie. And this is co target combinations using relational biology to say, in order to get more durable response, in order to break efficacy ceilings, in order to reach more patients, what multiple target combinations should we go after to have more effective drugs and to give a quick sort of preview on what that looks like. We were able to screen 30 million target pairs because you can imagine that just gets exponentially more difficult as you deal with more targets. 30 million target pairs in a few days and put another dozen target pairs into our system. We ended up increasing overall the size of our early stage preclinical portfolio by 40% in part due to these efforts. So it was a huge impact.
[19:02]
C
So we talked about breaking the efficacy ceiling, like using the AI to rank order target combinations. Can you give our audience a sense of what is it that the AI see? The traditional approach is missed. Is it a matter of speed and the ability to process just huge numbers, or is it picking up on more subtle signals than you could as a human team?
[19:25]
B
Yeah, I think it's a bit of all the above. And it is important to be able to differentiate where it's coming from, where the confidence is coming from. So the output of these target ID engines is a scoring function. But then you look into that scoring function and see what is making up the components of leading to a higher or lower score. So one aspect you mentioned is the big data piece. It's just a lot of data to crunch. So that's something that AI is uniquely good at, being able to sort through very, very large data sets. The other piece, which I think is even more profound, is the multimodal data aspect of it where. How do you effectively weight all of the different data sources as you're generating confidence that this is a target that is related to this specific disease and one that we should go after to be able to do that in a methodical way and to be able to do that in a way where you're generating more confidence based on looking at legacy multimodal data sets of drugs that have moved into clinic and whether they've succeeded or failed. I think that's elucidating a space that is very hard for us to do individually as individual experts looking at the different types of data and then coming together and just talking about it in a room. So that I think is a bit unique. And then there's the last piece that I would say is identifying novel biomarkers which may not have been seen before, but lend credence to this is a biomarker that not only is giving evidence that this, this particular target is useful in disease A, B or C, but also that parlays into later when you talk about how will we actually run a clinical trial, how will we look at results of that particular drug in human beings? Can we use this biomarker as a way to get evidence to support whether it's working or not? So those are a couple of elements that we've seen come together unique to AI systems.
[21:14]
A
So, you know, you're using the AI to identify targets and then once you have a target, you've got to design a drug to sort of, you know, put a key in a hole, so to speak. And I know that for much more than the last two years, tools have been out there to do things like the AlphaFold being, you know, first and foremost among them. But there's also tools that are being brought in today around AI simulation and things like that. Can you speak to some of the things that you're discuss that are feeding into that aspect of stuff? Because that side of things, you know, has been around a lot longer than the large language models. But I'm guessing that the large language models are adding a layer to this that just didn't exist before and are making it either more efficient or more effective or both in the design steps here.
[22:02]
B
Yeah, yeah, absolutely. And I'm happy to take this in whatever direction you want. So we. We can go. I'll leave it up to you, Stephen. We can go to small molecule drug design. We can go to large molecule drug design first. Let's start with small.
[22:14]
A
Let's start with small for now.
[22:16]
B
Okay, let's start with small. So so this is a space where as, as you said, modeling and simulation has been used in these areas for decades. Right, so, so what's new? I think what's new is can you get to a state? And what I'll talk about is an effort that we call AI auto lead. This is the idea that if you can leverage AI not only in spots to predict certain biophysical properties of molecule, and we'll talk about those, but also can you leverage it to orchestrate a 24 hour running lab. And this is something we're building now where the goal is to shorten the time cycles by removing white space and by increasing probability of success by about 25 to 40% of our small molecule design.
[23:03]
A
You're seeing that level, huh? That's amazing.
[23:06]
B
So I'll say that we have not achieved it yet. And I'll tell you what we've achieved so far and what we're marching toward. That's the ultimate goal. We hope that that will be deployed in first quarter of 2027. So it's not that far away from when the lights go on. But I'll tell you what we've achieved so far. And we think that this can be applied to about 50% of our small molecule discovery pipeline. And I can talk to why we think they're dorset, but the idea is you give physical form to the AI. So what is AI missing? It's missing the arms and legs and hands. The outcome is not to be like a Terminator movie, though in this case the outcome is intended to be acceleration of small molecule drug discovery. So the concept is an accelerated closed loop design, make, test, analyze cycle. And this is not a new concept, but I can tell you at least how we're approaching it. The goal, as I said, is you reduce the number of handoffs all the way from ideation to the chemical synthesis of those series to the biological test and the results interpretation. And typically there's either a handoff or a break in that cycle. There's usually handoffs between chemical synthesis and biological tests. There's handoffs between results interpretation and the ideation of the next round. What we want to do is close the loop on all of that. So what we're doing is we identify hits and then after those hits are identified, in classical methods as well as in silico screening methods, you go into this lead optimization. This is typically where it takes a lot of time. You iterate and iterate and iterate on molecular design and that's where this AI auto lead comes into play. So the AI Layer begins the process of lead optimization. And this includes predictive models that are AI based, but also classical ones, classical QSAR models, informatics models, structure based information from cryo EM studies, things like that. So this is not just AI models, but it is the orchestration of many different models by AI in order to answer questions of what molecules should we make first? And again, it goes back to this. Not every tool needs to be or even should be AI. Then an automated schedule will start synthesizing those compounds from reagents that will be available in a chemical store already present in the automated lab. This is the piece that's not built yet, but it is designed and it is being constructed as we speak. And we'll go live in first quarter 2027. What's interesting is before we could build the hardware or while we're waiting and building the hardware, we've been testing the brains. We've been able to build the brains of this system that I just described. And we've tested the brains by allowing the brains of the system to tell us which series to make. And then we make them manually because we're building the automation and we see what happens. And the practical results are that the first program we ran through these brains where there were 200 compounds that were suggested for synthesis, out of millions screened in silico, 75% of those were able to be synthesized in the lab. So a 75% hit rate for what was generated de novo in terms of AI generation were able to actually be physically made out of those 150 compounds. Now, 1/3 were biologically active around 10 nanomolar range in terms of activity, of binding. And of those, those 50 compounds have now moved forward in that specific program and are progressing in preclinical work and future s work around those programs. So the brains are starting to work and now the next step is to couple it up to the body, coupled it up to the machines.
[26:36]
A
That's absolutely stunning. I mean the, the, the, the rates you just described and the pace is unbelievable. Now a lot of the things that you described in terms of that white space in between, a lot of that was human to human interaction in the past. And are you saying that you're eliminating the humans in the loop here and you're letting the machines take over all of the synthesis, or at least the primary synthesis? Is that kind of what you're saying here? And is that AI driven or how. Unpack that a little more for me, if you don't mind.
[27:07]
B
Yeah, and it's a, it is A good question, because this is sometimes where people get a little nervous about it in terms of where are humans in the loop? Where are they making decisions to progress? I think there's a couple key tenants and we actually have an, an AI charter within the company for how we use it and what the, the, the guidelines are. So a couple of things in this space. One thing to remember in this space is this is a preclinical space. We're talking about in this case, synthesizing chemicals, right? So the risk of letting the AI actually say let's automatically start generating the new round of lead compounds is very, very low. Right. There's no safety risks to human beings in that sense. So this is a space where we can let the AI start to take on some decisions autonomously about what to do next. The other thing to say is that when we say, are we taking humans out of the loop? There are elements where we're going to allow the AI to continuously run, but the human being will always be in the loop with respect to decisions that do we move this molecule forward based on the evidence to the next stage of governance. That is always a human decision. This is a system by which we can generate more qualitative, higher quality evidence that is done in a systematic way in order to support those decisions. And I will also say that our goal is really to increase both the volume of what we're able to handle as well as increase the speed that we're able to progress. So I mentioned before, we think about 50% of our pipeline might be able to be handled this way. In terms of small molecule pipeline, what's limiting that is that automation in the way that we're designing and building it can't handle every type of synthetic step that you need to access the chemical diversity. But when we've done an analysis of our historical pipeline of small molecule drug discovery, we think it could probably handle about half. And later on you'll build in the missing pieces of unique chemical reactions that you have to go off deck for at the moment. So we're doing this in a stepwise fashion.
[29:04]
C
So, Matt, the results you're describing are really fascinating and the protein engineering stuff is just striking. You're cutting development time in half and the predictions are outperforming alphafold, which sort of a key metric in the industry. I'm tempted to ask what the leverage point is, or at least what you think the leverage point is. Is it the data? Is it the way it's organized? Is it proprietary data sets? What do you think is Helping you achieve this unprecedented, an impressive level of pragmatic success.
[29:35]
B
Yeah. So, and I'll put it in the context of I'll switch to kind of small molecule to the large molecule, as you mentioned, with alphafold design and things like this. Our biologics AI moonshot initiative, which is focused specifically on the development, discovery and engineering of large molecule drugs. And it is reliant on the data for sure. There's another piece that I'll mention as well, but starting with the data. One of the things that every company has to deal with is when you start to move into this world of AI and you need access to AI ready data sets, what do you do with your data? What do you do if you're a large pharma company, decades of historical data that is not in AI ready format? Do you try to boil the ocean and collect it all or do you go on a strictly go forward basis, you capture new data in a structured way? Or are you selective about how you look at legacy data? And I think that the approach we've taken is a very pragmatic one, which is we've been very selective about how we look at legacy data. And I think it makes sense because there are many areas where you may have no competitive advantage to look at your old historical data and it's just kind of wasting time and effort, resources. And there are areas where you may have an advantage. One area that we saw we had an advantage to answer your question specifically, is in biologics and large molecule design. We were sitting on hundreds of millions of antibody paired VHvl sequences collected with that was affinity data, epitope data. All of those antibodies were generated in the same way, collected in the same way. Bioinformatics analysis was run in the same way. This was a result of our acquisition of KYMAB many years ago. Similarly, our acquisition of AB links, which is where we, where we brought in nanobody capability for basically VHH large molecule drugs, we were also sitting on with that acquisition hundreds of millions of sequences that were fairly unique to the industry. So we did go back and curate and annotate those data sets and then train and work with external partners that had very large protein language models. We could then fine tune train those models on our proprietary data sets that we thought had value and that we thought were relevant, particularly in the binding regions, CDR regions of proteins for drugs. And indeed we found that was what was able to increase our ability to deliver predictions that were more spot on, more accurate for structure, guide drug design, for epitopes, Et cetera. And then the other piece that I'll mention, and we could get into more detail if you'd like, is that's just one piece. That's structure. How do you pull together the rest? Structure does not make a drug. Right. There's a lot more variables. And that's the other piece that we're pulling together now and happy to go into more detail of what we're going to do in 2026 to pull all of it together into one dashboard for large molecule drug design.
[32:38]
C
So let me try a thesis on you first. I mean, I just want to synthesize for our audience what I'm hearing that's really intriguing me. Right. So we know that AI and pharma isn't just better algorithms. It's not having the best model, it's having the right data. And what I'm seeing is a confirmation of something Steve and I suspected, which is companies that are curating biological data for decades and investing in that curation, have the capability to bring the data together, are going to have a persistent structural advantage. In a sense, what's emerging is that a competitive moat is going to consist of access to highly curated data. What do you think of that? Is that consistent with the way you're seeing the world?
[33:18]
B
It is, it is. And I think that what's going to be interesting is you mentioned, you use the words, I think very appropriately, which is highly curated data. So when you think about something like how do you mine the external world of scientific literature? There are multiple layers to that. Right. And we've worked, it's public that we've worked with partners to curate those data sets and make them available to folks internally as they look at experiment ideation. And what you see is if you compare two models, a model that was trained on 30 million publication abstracts versus a model that was trained on 30,000,000 publication full text and images has a very different ability to answer the question that you need. They both look on the surface level like they're answering the question. But as you dig into more detail about very specific understanding of what was done in the experiment, one can be more biased based on how the authors wrote the abstract, the other can be more grounded in truth based on data in a publication. So these are really important factors to consider. And ontology goes into that. How do you name things? Does the system understand that the same protein can be named three different ways by three different researchers? And how does it pull that together to say that that when you're linking nodes and edges in a knowledge graph this is actually the same node. So all of this is really important.
[34:41]
C
Great perspective. I want to give you a chance to come back and drill a little bit deeper into the details that you wanted to talk about last time. But also let me seed it with a question that you're also applying what you're doing to nanobodies, which is unique to Sanofi. Right. It's a sort of a specialty of yours. What are you seeing anything different there? Is there something about the specialization that's allowing you to go faster or perhaps slowing you down?
[35:08]
B
Yeah, yeah. I mean, with, with reference to nanobodies, the unique aspect that I found really valuable in the nanobody platform is the ability to stitch them together like beads on a string and get functionality in a more straightforward way than perhaps you can with other modalities. And so, in other words, a good example is our IL13T slip binder lunsucamig, which is in the clinic now. This is a nanobody based drug. You'll often see it referred to as a bispecific. It's actually a Pentas specific. There are five binders like beads on a string in one protein sequence. It expresses well. So cost of goods should be okay. The purity looks good. So the things that classically plague these super complex Franken molecules are well controlled. And those five binders achieve different aims. They hit IL13 NTSLP, two different targets and they hit multiple epitopes. So you can actually fine tune not only the affinity of a single binder, but now you have multiple epitopes binding simultaneously. You also have half life extension from a binder that attaches to human serine albumin. So you can start to have functionality across five different functional units in a single single drug that's expressed by a protein. That's remarkable. And that's in the clinic now.
[36:26]
C
Right.
[36:26]
B
And we're getting readouts over time. So you can imagine any mod that level of ability. Things happen. One, it's extremely versatile to apply to many different targets and combination of targets. The other is the complexity gets to be kind of crazy. And we talked about how target identification complexity gets exponentially large. It's the same with protein design space. It gets exponentially large, the linkers in between the nanobodies themselves. So we have built our own proprietary models in terms of de novo design of nanobodies, which is very cutting edge and not fully there yet. We have a single nanobody in our pipeline that actually was done with a zero shot de novo design that is an outlier. I hope that Someday, that's where the field goes and that's what we are working on that. But that is an outlier. At the moment, the models are just not good enough to be ubiquitous across all different targets, areas that, that haven't been seen before. But what has been very successfully a successful approach across multiple different targeting drugs has been the ability to use next generation protein language models specifically trained on our proprietary data to do things like address, let's say the binding affinity, the specific epitope, target agnostic models for things like yield and stability, poly reactivity to look at purity. So these types of models we have been able to apply across a wide swath of our portfolio.
[37:58]
A
So, you know, I'm just enamored by this discussion. I mean like you're bringing together so many different parts of my career and you're accelerating it in a way that many of the folks I've worked with literally only dream about. And you guys are making it happen in, in real time. Basically the next step of where this all goes is once you've got a molecule, once you've got it designed, you've got to test it in animal models. You need to understand pkpd. I don't want to dwell on this one for too long, but how is it affecting those pieces when you're actually going from the lab into some in vivo models and how it's working in that stage?
[38:38]
B
Yeah, so that's where the rubber meets the road.
[38:41]
A
Right.
[38:41]
B
You go from in silico to lab based, cellular based, and then you get into in vivo models that you believe you have designed and you hope are relevant to human biology. And when you start to get into that stage, there's a couple of approaches that we're taking and it's really to address what are the biggest failure modes of drugs. So for example, once you get into the clinic, 30% or so of drugs fail due to tox. Right. Unexpected tox. And that's unexpected tox as in it wasn't predicted previously in silica, it wasn't seen in animal studies. And so this is truly unexpected tox. So a couple of things we're addressing. There are building specialist models to look at things like cardiotox. Can we simultaneously detune, let's say a small molecule drug for hitting HERG receptor while we are designing that drug to hit the target receptor? And can we do this in real time? In every round of lead engineering, there are other areas where we're taking a similar approach to hepatotoxicity with respect to biologic drugs. Things like immunogenicity are important in terms of cytokine release. So there are elements that go into the tox picture that we're trying to strip out, the ones that we have the most data on that are the biggest culprits for drugs failing as you get into the clinic. And can we de risk that earlier by looking for signs of that in silico and then as we go into relevant animal models. So that's one of the ways that we're approaching it. And then what we'll probably get into, into the, the part two section, but just to give the preview is this data then needs to feed forward into how we're building digital patient twins so that we can the holy grail. And we're not there yet, but the holy grail would be, can you take all of this information, both the in silico predicted data as well as the wet lab data, and run your digital patient twin model in such a way that you predict what the clinical trial result would be with higher confidence before you even synthesize those molecules? We are not there yet, but you can see how we're slowly marching to a state where you want to link all these things together in order to try to get to that sort of
[40:49]
A
nirvana that's just, just, this is amazing. Like this whole conversation I'm just listening at various times and some like getting goosebumps, thinking, my God, this is this kind of stuff that we've been dreaming about. And all of a sudden it's, you know, we're now in, we're approaching what in my world feels like Star Trek. I mean, this is the kind of thing that the computers were talking about in 1966 when Star Trek came on board. How do you synthesize a new vaccine without even, you know, just in data alone? And you guys are actually out there doing it everywhere, from the preclinical work to the talks, the pdpk. It's actually stunning. I'm going to hand it off to Leon and go to the next phase of the discussion.
[41:36]
C
And my job here is going to be to pop the optimism balloon
[41:42]
B
because
[41:42]
C
we need to talk about what's not working. Right? I mean, we, you know, let's be fair, right? You guys have had tremendous technical and practical successes. No worth celebrating. But we all know that new technologies don't work the way you want them to. So you offered kindly to talk about where the AI is working for you. Where are the gaps that you're seeing?
[42:01]
B
Yeah, and there are many gaps. And it's important to ground ourselves in what's successful and what's not. So we can see where do we actually want to invest in the areas that are not successful and how long will it take to get there. So a couple of things where you see things start to fall down. The biggest piece is grounded in the data and the foundations of that data. And that is where. And you both know this very well with your backgrounds, right? The model is only as good as the data it's trained on. And you can only even train that model if the data is in one place. So that's the biggest challenge to start with, is you have all these great ideas of where to apply interesting modeling tools, but you need to have high quality data that is ready for use in those tools, in a place where it can be found and in a place where it can be appropriately used. So getting those data foundations in place is something we've been working on for the last several years. And it's something that is always a continual journey. And I'll give a simple example. If you make an acquisition and you bolt on a company or you acquire a product at a various stage of discovery or development, data comes with it, but it doesn't get integrated on its own. So now how do you integrate the data from that company? Sanofi is a company of hundreds of acquisitions over the years. That's been one of the biggest challenges is how do you actually integrate the data in a way that you can use it to build appropriate models? So that piece is always the first piece you have to solve, and it's always one that can hold up timelines for developing the tools that you want. And you have to be selective about what you go after. First, second, third. I talked about some of the things that we intentionally went after. Other areas where it falls down are if you are working with AI methodologies that are not explainable, that can be a big challenge. And the way I like to tell it to people is if you could show me it was 99.9% right all the time, but nobody knew how to work, or it wasn't explainable and it was in a space where there was very low risk, we'll just use it. I mean, that would be great. If it's a low risk space and it's always working well, then that's okay. It's probably better than what we're doing now. But the reality is that we're not there with these systems. So I think what's important is to build explainability into it, at least from the perspective of you have to know what data sets to using to answer a question. You have to be able to follow a logical train of thought to be able to convince folks that, yes, this is a option worth pursuing. And so that's another piece. And then the another one that I'll talk about is the change management piece, because this actually fits into the change management with any new technology. Getting folks to adopt it at the right time is really critical. And adopt it when it's ready, not before. And adopt it quickly when it is ready to be able to affect the transformation very quickly. And so we talked about some of the ways we're doing that with trying to get folks comfortable to understand these new technologies where they fit, where they don't, through upscaling the workforce. But really the next step is how do you fundamentally get widespread adoption of these tools once they're ready to deploy. And that's something that also it does remain a challenge. You'd think that, oh, a tool works, it should easily be adopted, but that's not necessarily the case if you have 10 years and you're used to a certain system. So these are all things that, you know, we have to navigate and work on together. And then maybe the last thing I would say of where, where does it fall down? There's still a lot of white space in terms of handoffs of both data we mentioned, but also results and output of models. So you could have a great model and it accelerates a part of your workflow for a group of, let's say 50 people. But then the output of that is still manually handed over to the next team to start their portion of the workflow, if that makes sense. We need to start connecting each of those aspects of the workflow so that the information can be fed into the next step, whether that is wet lab work or further in silico prediction in an automated way. And that's where agentic AI systems come in. That's where we're trying to push into that direction with partners next. But that is still a gap.
[46:15]
C
So, Matt, honest and detailed answer, the two things I wanted to press on, or at least highlight for the audience is one, boy, am I glad you mentioned data integration is an ongoing and curation is an ongoing challenge.
[46:28]
B
Right.
[46:29]
C
If that piece is not done, it's an assumption of training AI models that somebody's done that dirty work. So that's a really good thing for all of us to remember and perhaps AI can help with that dirty work. Then the other is you started thinking, alluded to, sort of Differences in failure modes, some of which involve patient safety, those are absolutely required different handling. What are some of the big failure modes that you're seeing when things go wrong in the work that you're doing? What does that look like? In what way do they go wrong? And how do you manage the different kinds of risk and the different levels of risk?
[47:05]
B
Yeah, yeah. So in terms of the practical, as you develop these tools, failure modes of what can go wrong. And it's interesting because it's not going to be unique or new to AI, you're going to hear similarities to other capabilities technologies that have been deployed over the years. So one is if there's too much of a focus on a deliverable misaligned to the actual goal and intent. In other words, if you see a KPI of deploy 20 AI models in one ecosystem for protein engineers to use, okay, maybe that was a good KPI when someone sat down and thought of it. But is that relevant to your outcome? What are you actually trying to drive as an outcome? Because if you hit 20 models, but those 20 models are inferior to existing legacy models, obviously it wasn't a good KPI. So we see things like that all the time where you have to be really critical and refine, where you're placing your bets, where you're investing, so that when you get to the end, you don't just have an outcome which meets an arbitrary KPI, but doesn't actually move the needle on what you're trying to achieve. And the reason I mentioned that one is that because there are so many areas where you can apply these tools, we are getting ideas from the entire organization and we can't possibly fund, you know, even a fraction of all of the ideas that are coming. So one of the hardest things is to pick which of these ideas actually has both the biggest return on investment for the entire company, not just for one group, and also aligns with the technology, is on the edge of actually delivering it. So you need to have the interface and interconnection of both in order to select that idea to move forward. So those are some of the failure modes that we're seeing just practically of how you develop these things. So I think we need to start
[48:49]
A
wrapping up and I want to just sort of tease for next next episode because is we've only scratched half of the, of the R and D pipeline. We've only scratched the, the research side of things from discovery up through just getting into animal models. We're going to talk next time about development, development, operations, medical, because you know, People may think that pharma is one business, but pharma is really like a string of pearls where everything is sort of one thing connects to the other, connects to the other. And they're almost like independent businesses that have to, you know, work in synchrony and sometimes serial, sometimes in parallel. And when it gets into the clinical trial stage, things get even more complicated than what we've been discussing today. And I'm really looking forward to that. I spent more of my career in that phase of the world in pharma than I have in the discovery and the preclinical work. So I'm really stoked for the next time we get together and have that conversation. I'm going to hand it back to Leon to bring us home.
[49:43]
C
So we like to ask a field notes questions as we close. Right. So if you were advising another pharma company that's starting their AI journey in R and D, what would you tell them in 30 seconds or less about where to get started?
[49:57]
B
I would say get your house in order. In terms of data foundations, that's going to be the most critical place to start. It is the not sexy, boring sounding thing, but is absolutely critical and foundational to everything else you're going to build. So that's the first place I would start. And then I would say really focus on what you want to do internally versus what you want to partner with others because you won't be able to do everything yourself. We can't do everything ourselves. We shouldn't do everything ourselves. And in that vein, you look at where do you have something unique or does the partner have something unique? It could be a data set, it could be talent, it could be a certain capability. And that's where you marry those together and that's where you'll find the biggest impact in being able to drive your initiatives forward successfully.
[50:41]
C
A great 30 second answer. Not. Thank you so much for joining us. It was a terrific start and I'm really excited for part two of our conversation where we're going to switch to the D side of the R and D pipeline and find out what you guys have been doing there. That sounds at least as exciting. So with that, I just want to thank you. Thanks Steve, and thank our audience and I look forward to seeing all of you again on part two of Practical AI and Healthcare.
[51:13]
A
As many of our listeners know, Leon and I work very closely with the DCI Network Division of Clinical Informatics at Beth Israel Deaconess Medical center in Boston. This June, the network is hosting Patient powered Digital Health 2026. The conference will bring together patients, innovators, industry leaders, healthcare providers and policymakers to shape the next generation of real world patient centered solutions. The meeting will run from June 22nd to the 24th in Boston at Harvard Medical School. We've arranged for our listeners to get a discount on registration to the meeting. If you register between now and May 15th and use promo code PracticalAI June no spaces, you'll receive 30% off your registration fee. You can learn more at dcinetwork.org patients2026. In addition, we're always looking for sponsors. If you or your company are interested in becoming a sponsor, please reach out to Adminci Network. See you in Boston. Thank you for joining us this week on Practical AI in Healthcare. If you're ready to go beyond buzzwords and hype and explore how AI is truly transforming healthcare, stay tuned for more conversations that get us to what works. Until next time, stay pract.