Loading summary
Kate Soule
Foreign.
Host Announcer
The podcast that makes artificial intelligence practical, productive and accessible to all. If you like this show, you will love the Changelog. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays. An awesome talk show for your weekend enjoyment. Find us by searching for the Changelog wherever you get your podcasts. Thanks to our partners@FlyIO. Launch your AI apps in five minutes or less. Learn how@FlyIO.
Chris Benson
Welcome to another episode of the Practical AI Podcast. This is Chris Benson. I am your co host. Normally, Daniel Whitenack is joining me as the other co host, but he's not able to today. I am a principal AI research engineer at Lockheed Martin. Daniel is the CEO of Prediction Guard. And with us today we have Kate Sol, who is Director of Technical Product Management at Granite for IBM. Welcome to the show, Kate.
Kate Soule
Hey, Chris, thanks for having me.
Chris Benson
So I wanted to I know we're going to dive shortly into what Granite is, and some of our listeners are probably already familiar with it, some may not be. But before we dive into that wondering, we're talking about AI models. That's what granite is, and the world of LLMs generative AI. Wondering if you could start off talking a little bit about your own background, how you arrived at this, and we'll get into a little bit about what IBM is doing and why it's interested in and how it fits into the landscape here for those who are not already familiar with it.
Kate Soule
Perfect. Yeah. Thanks, Chris. So I lead the technical product management for granite, which is IBM's large family of large language models that is produced by IBM Research. And so I actually joined IBM and IBM Research a number of years ago before large language models were really became popular. You know, they had a bit of a Netscape moment right back in November 2022. So I've been working at the lab for a little while. I am a little bit of a odd duck, so to speak, in that I don't have a research background, I don't have a PhD. I come from a business background. I worked in consulting for a number of years, went to business school and joined IBM Research and the AI lab here in order to get more involved in technology. I've always kind of had one foot in the tech space. I was a data scientist for most of my tenure as a consultant and always, you know, thought that there was a lot of exciting things going on in AI. And so I joined the lab and basically got to work with a lot of generative AI researchers before large language models really kind of became big. And you know, about two And a half years ago, a lot of the technology we were working on, all of a sudden we started to find and see that there were tremendous business applications. You know, OpenAI really demonstrated what could happen if you took this type of technology and Borsch fed it enough compute to make it powerful. It could do some really cool things. So from there we worked as a team really to spin up a program and offering at IBM for our own family of large language models that we could offer our customers in the broader open source ecosystem.
Chris Benson
I'm curious, one of the things that we've noticed over time is different organizations kind of are positioning these large language models within their product offerings in very unique ways. And we could go through some of your competitors and say they do this way. How do you guys see that in terms of how large language models fit into your product offering? Is there a vision that IBM has for that?
Kate Soule
Yeah, I think the fundamental premise of large language models are that they are a building block that you get to build on and reuse in many different ways, where one model can drive a number of different use cases. So from IBM's perspective, that value proposition resonates really clearly. We see a lot of our customers, our own internal offerings where there's a lot of effort on data curation and collection and creating and training bespoke models for a specific task. And now with large language models, we get to kind of use one model. And with very little labeled data, all of a sudden the world's your oyster. There's a lot you can do. And so that's a bit of the reason why we have centralized the development of our large language models within IBM research. Not a specific product, it's one offering that then feeds into many of our different products and downstream applications. And it allows us to kind of create this building block that we can then also offer customers to be able to build on top of as well. And open source ecosyste system developers, you know, we think there's a lot of different applications for that one offering. And so, you know, that's a little bit kind of from the organizational side why we're. Why it's kind of exciting, right, that we get to do this all within research. We don't have a P and L, so to speak. We're doing this to create ultimately a tool that can support any number of different use cases and downstream applications.
Chris Benson
Very cool. And you mentioned open source. So I want to ask you, because that's always a big topic among organizations is if I remember correctly, Granite is under an Apache 2 license, is that correct?
Kate Soule
That's correct.
Chris Benson
I'm just curious because we've seen strong arguments on both sides. Why is Granite an open source license like that? What was the decision from IBM to go that direction?
Kate Soule
Yeah, well, there was kind of two levels of decision making that we had to make when we talked about how to license Granite. One was open or closed. So are we going to release this model, release the weights out into the world so that anyone can use it regardless if they spend a dime with IBM? And ultimately IBM believes strongly in the power of open source ecosystems. A huge part of our business is built around Red Hat and being able to provide open source software to our customers with enterprise guarantees. And we felt that OpenAI was a far more responsible environment to develop and to incubate this technology as a whole.
Chris Benson
And when you say OpenAI, you mean open source, Open source AI? Just making sure.
Kate Soule
Very important clarification. Very important clarification. So that was why we released our models out into the open. And then the question was under what license? Because there are a lot of models, there are a lot of licenses and a bit of the moment that everyone's seeing is you have a Gamma license for a Gamma model, you've got a Llama license for Llama model. Everyone's coming up with their own license and you know, it kind of, in some ways it makes sense. Models are a bit of a weird artifact. They're not code, you can't execute them like on their own, they're not software, they're not data per se, but they are kind of like a big bag of numbers at the end of the day. So like, you know, some of the traditional licenses, I think some people didn't see a clear fit and so they came up with their own. There are also all these different kind of potential risks that you might want to solve for with a license with a large language model that are different than risks that you look at with software or data. But at the end of the day, IBM really wanted just to keep this simple, like a no nonsense license that we felt would be able to promote the broadest use from the ecosystem without any restrictions. So we went with Apache 2 because that's probably the most widely used and just easy to understand license that's out there. And I think it really speaks also to where we see models being important building blocks that are further customized. So, so we really believe the true value in generative AI is being able to take some of these smaller open source models and build on top of it and even start to customize it. And if you're doing all that work and building on top of something, you want to make sure there are no restrictions on all that IP you've just created. And so that's ultimately why we went with Apache 2.0.
Chris Benson
Understood. And one last follow up on licensing and then I'll move on. It's partially just a comment. IBM has a really strong legacy as someone in the AI world and decades of software development along with that. I know both Red Hat with the acquisition some years back being strong on open source and IBM both before and after has it, I'm just curious, did that make it any easier, do you think, to go with open source? Like hey, we've done this so much that we're going to do that with this thing too, Even though it's a little bit newer in context, culturally, did it seem easier to get there there than some companies that possibly really struggle with that, that don't have such a legacy in open source?
Kate Soule
I think it did make it easier. I think there are always going to be like any, any company going down this journey has to take a look at wait, we're spending how much on what and you're going to give it away for free and come up, you know, with their own kind of equations on, on how this starts to make sense. And I think we've just experienced as a company that the software and offerings we create are so much stronger when we're creating them as part of an open source eco ecosystem than something that we just keep close to the best. So you know, it was a much easier business case, so to speak, to make and to get the sign off that we needed. Ultimately our leadership was very supportive in order to encourage this kind of open ecosystem.
Chris Benson
Fantastic. Turning a little bit as IBM was diving into this, into this realm and starting, you know, and obviously like you have a history with granite, you guys are on 3.2 at this point. That means that you've been working on this for a period of time. But as you're diving into this very competitive ecosystem of building out these open source models that are, that are big, they are expensive to make and they're, and you're looking for an outsized impact in the world. How do you decide how to proceed with what kind of architecture you want? You know, how did you guys think about like, like you're looking at competitors, some of them are closed source, like OpenAI is, some of them like Meta AI, you has llama and that series, as you're looking at what's out There, how do you make a choice about what is right for what you guys are about to go build? Because that's one heck of an investment to make. And I'm kind of curious, when you're looking at that landscape, how you make sense of that in terms of where to invest.
Kate Soule
Yeah, absolutely. So I think it's all about trying to make educated bets that kind of match your constraints that you're operating with and your broader strategy. So early on into our generative AI journey, when we were kind of getting the program up and running, we wanted to take fewer risks. We wanted to learn how to do common architectures, common patterns, before we started to get more innovative in coming up with net new additions on top. So early on the. And also, you have to keep in mind, this field has just been changing so quickly over the past couple of years, so no one really knew what they were doing. If we look at how models were trained two years ago and the decisions that were made, the game was all about as many parameters as possible and having as little data as possible to keep your training costs down. And now we've totally switched. The general wisdom is as much data as possible in a few parameters as possible to keep your inference costs down once the model's finally deployed. So the whole field's been going through a learning curve, but I think early on our goal was really working on trying to replicate some of the architectures that were already out there, but innovate on the data. So really focus in on how do we create versions of these models that are being released that deliver the same type of functionality, but that were trained by IBM as a trusted partner, working very closely with all of our teams to have a very clear and ethical data curation and sourcing pipeline to train the models. That was the first major innovation aim that we had was actually not on the architecture side. Then as we started to get more confident as the field started, I don't want to say mature because we're still very, again, very early innings, but we started to coalesce to some shared understandings of how these models should be trained and what works or doesn't. Then our goal really has started to focus on from an architecture side, how can we be as efficient as possible? How can we train models that are going to be economical for our customers to run? That's where you've seen us focus a lot on smaller models for right now, and we're working on new architectures, for example, mixture of experts. There's all sorts of things that we are really focusing in really with kind of the mantra of how do we make this as efficient as possible for people to further customize and to run in their own environments.
Chris Benson
So that was a fantastic start to, as we dive into granite itself, kind of laying it out. Your last comments, you talked about kind of the smaller, more economical models so that you're getting efficient inference on the customer side. You mentioned a phrase which some people may know, some people may not. Mixture of experts maybe talk as we dive into, you know, what granite is and its versions going forward here, maybe start with mixture of experts and what you mean by that.
Kate Soule
Absolutely. So if we think of how these models are being built, they're essentially billions of parameters that are representing small little numbers that basically are encoding information. And you know, to like draw a really simple explanation. If you have a, you know, a linear regression, like you've got a scatter point and you're fitting a line, Y equals MX plus B, like M is a parameter in that equation. Right? So this, that. Except on the scale of billions with mixture of experts, what we're looking at is do I really need all 1 billion parameters every single time I run inference? Can I use a subset? Can I have kind of little expert groupings of parameters within my large language model so that at inference time I'm being far more selective and smart about which parameters get called? Because if I'm not using all 8 billion or 120 billion parameters, I can run that inference far faster. It's much more efficient, really. It's just getting a little bit more nuanced of instead of, I think a lot of early days of generative AI is just throw more compute at it and hope the problem goes away. We're now trying to figure out how can we be far more efficient in how we build these models.
Chris Benson
So I appreciate the explanation on mixture of experts and that makes a lot of sense in terms of trying to use the model efficiently and inference by reducing the number of parameters. I believe right now you guys have, is it 8 billion and 2 billion are the model sizes in terms of the parameters, or have I gotten that wrong?
Kate Soule
We got actually a couple of sizes. So you're right. We've got 8 billion and 2 billion. But speaking of those mixture of expert models, we actually have a couple of tiny MOE models. MOE stands for mixture of experts. So we've got MOE model with only a billion parameters and an MOE model with 3 billion parameters. But they respectively use far fewer parameters at inference time. So they run really, really quick. Designed for more local applications like running on a cpu.
Chris Benson
So when you make the decision to have different size models in terms of the number of parameters and stuff, do you have different use cases in mind of how those models might be used? And is there one set of scenarios that you would put your 8 billion and another one that would be that 3 billion that you mentioned?
Kate Soule
Yeah, absolutely. So if we think about it, when we're kind of designing the model sizes that we want to train, and a huge question that we're trying to solve for is what are the environments these models going to be run on and how do I maximize performance without forcing someone to have to buy another GPU to host it? There are models like those small MOE models that were actually designed much more for running on the edge locally or on a computer, like just a local laptop. We've got models that are designed to run on a single GPU, which is like our 2 billion and 8 billion models. Those are standard architecture, not MOE. And we've got models on our roadmap that are looking at how can we kind of max out what a single GPU can run and then how can we max out what a box of GPUs could run? So if you got eight GPUs stitched together. So, you know, we are definitely thinking about those different kind of tranches of compute availability that customers might have. And each of those tranches could relate to different use cases. Like obviously, if you're thinking about something that is local, there's all sorts of IoT type of use cases that that can could target. If you are looking at something that has to be run on a box of eight GPUs, you're looking at something that you have to be okay with having a little bit more latency time it takes for the model to respond, but where the use cases also probably needs to be a little bit higher value because it costs more to run that big model. And so you're not going to run a really simple. Help me summarize this email task hitting 8 GPUs at once.
Chris Benson
So as you talk about kind of the segmentation of these, of the, of the family of models and how you're doing that, I know one of the things you guys have a white paper which we'll be linking in on the show, notes for folks to go and take a look at either during or after they listen here and you talk about some of the models being experimental chain of thought reasoning capabilities. And I was wondering if you could talk a little bit about what that means.
Kate Soule
Yeah. So really excited with the latest release of our Granite models, just the end of February, we released Granite 3.2, which is an update to our 2 billion parameter model and our 8 billion parameter model. One of the superpowers we give this model in the new release is we bring in an experimental feature for reasoning. What we mean by that is there's this new concept, relatively new concept in generative AI called inference time computer, where what that really equates to, just to put it in plain language, if you think longer and harder about a prompt, about a question, you can get a better response. I mean, this works for humans. This is how you and I think. But the same is true for large language models. And thinking here is a bit of a risk of anthropomorphizing the term. But where we've landed as a field, so I'll run with it for now is really saying generate more tokens. So have the model think through what's called a chain of thought, generate logical thought processes and sequences of how the model might approach answering before triggering the model to then respond. We've trained Granite 8B3.2 in order to be able to do that chain of thought reasoning natively. Take advantage of this new inference time compute area of innovation. And what we've done is we've made it selective. So if you don't need to think long and hard about what is two plus two, you turn it off and the model responds faster just with the answer. If you are giving it a more difficult question and pondering the meaning of life, you might turn thinking on and it's going to think through a little bit first before answering with a much in general, a longer kind of more chain of thought style approach where it's explaining kind of step by step why it's responding the way it is.
Chris Benson
Do you anticipate kind of, and I've seen this done from different organizations in different ways. Do you anticipate that your inference time compute capability is going to be kind of there on all the models and you're turning it on and off? Or do you anticipate that some of the models in your family are more specializing in that and that's always on versus others? Which way? You kind of mentioned the on and off, so it sounded like you might have it in all of the above?
Kate Soule
Yeah, right now it's marked as an experimental feature. I think we're still learning a lot about how this is useful and what it's going to be used for. And that might dictate what makes sense moving forward. But what we're seeing is kind of universally it's useful one to try and improve the quality of the answers. But two, as an explainability feature, if the model is going through and explaining more how it came up with a response that helps a human better understand the response, I think it is something we're heavily considering just baking into the models moving forward. Which is a different approach. Right. Than some models which are just focused on reasoning. I don't think we're going to see that very long. I think more and more we're going to see more selective reasoning. So Claude 3.7 came out. They're actually doing a really nice job of this where you can think longer or harder about something or just think for a short amount of time. So I think we're going to see increasingly more and more folks move in that direction. But I think there's still again early innings, I'll say it again. So we're going to learn a lot over the next couple of months about where this is having the most impact and I think that could have some structural implications of how we design our roadmap moving forward.
Chris Benson
Gotcha with there has been a larger push in the industry toward, towards smaller models. So you know, kind of going back over the, the recent history of, of LLMs and you know, you saw initially, you know, the, just the number of parameters exploding and the models becoming huge and obviously, you know, we talked a little bit about the fact that that's very expensive on inference to run these things. And over the last, especially over the last, I don't know, year, year and a half there's been a much stronger push especially with open source models. We've seen a lot of them on hugging face pushing to smaller. Do you anticipate as you're thinking about this capability of being able to reason that that's going to drive smaller model use toward models like what you guys are creating where you're saying okay, we have these large, you know, Claude has the, you know, big models and out there, you know, as an option or, or a llama model that's very large. Are you guys anticipating kind of pulling a lot more mind share towards some of the smaller ones? And do you anticipate that you're going to continue to focus on, on these smaller, more efficient ones where people can actually get them to deployed out there without breaking the bank of their organization? How does that fit in?
Kate Soule
Yeah. So look, one thing to keep in mind is even without thinking about it, without trying, we're seeing small models are increasingly able to do what it took a big model to do yesterday. So you look at what a tiny 2 billion parameter our Granite 2B model, for example, outperforms on numerous benchmarks. You know, Llama 270B, which is a much larger but older generation. I mean, it was state of the art when it was released, but the technology is just moving so quickly. So, you know, we do believe that by focusing on some of the smaller sizes, that ultimately we're going to get a lot of lift just natively because that is where the technology is evolving. Like, we're continuing to find ways to pack more and more performance and fewer, fewer parameters and expand the scope of what you can accomplish with a small language model. I don't think that means we're going to ever get rid of big models. I just think if you look at where we're focusing, we're really looking at kind of where are the models. If you think of the 80, 20 rule, 80% of the use cases can be handled by a model, maybe 8 billion parameters or less. That's what we're targeting with granite. And we're really trying to focus in. We think that there's definitely still always going to be innovation and opportunity and complex use cases that you need larger models to handle. And that's where we're really interested to see, okay, how do we expand the granite family, potentially focusing on more efficient architectures like mixture of experts to target those larger models and more complex model sizes so that you still get a little bit more of a more practical implementation of a big model. Recognizing that, again, it's not. You're always going to need. There's always going to be those outliers, those really big cases. We just don't think there's going to be as much business value, frankly, behind those compared to really focusing and delivering value on the small to medium model space.
Chris Benson
I think that's one thing Daniel and I have talked quite a bit about, is that we would agree with that. I think the bulk of the use cases are for the smaller ones. While we're at it, we've been talking about various aspects of granite a bit, but could we take a moment and have you kind of go back through the granite family and kind of talk about each component in the family, what it, you know, what it's called, what it does, and just kind of lay out the array of things that you have to offer?
Kate Soule
Absolutely. So the granite model family has the language models that I went over. So between 1 billion to 8 billion parameters in size. And again, we think those are like the the workhorse models, 80% of the tasks, we think you can probably get away with a model that's 8 billion parameters or less. We also, with 3.2, recently released a vision model. These models are for vision understanding tasks. That's important. It's not vision or image generation, which is where a lot of the early hype and excitement on generative AI came from Is like Dall E and those we're focused on models where you provide an image in a prompt and then the output is text, the model response. So really useful for things like image and document understanding. We specifically prioritize a very large amount of document and chart Q and a type data in its training data set, really focusing on performance on those types of tasks. So you can think of, you know, having a picture or an extract of a chart from a PDF and being able to answer questions about it. We think there's a lot of opportunity. So RAG is a very popular workflow in enterprise right retrieval, augmented generation. Right now, all of the images in your PDFs and documents, they all get basically thrown away. But we really are working on can we use our vision model to actually include all of those charts, images, figures, diagrams to help improve the model's ability to answer questions in a RAG workflow. So we think that's going to be huge. So lots of use cases. On the vision side then we also have a number of companion models that are designed to work in parallel with a language model or vision language model. We've got our Granite Guardian family of models, we call them guardrails. They're meant to sit right in parallel with the large language model that's running the main workflow. They monitor all the inputs that are coming into the model and and all the outputs that are being provided by the model. Looking for potential adversarial prompts, jailbreaking attacks, harmful inputs, harmful and biased outputs. They can detect hallucinations and model responses. So it's really meant to be a governance layer that can sit and work right alongside Granite can actually work alongside any model. So even if you've got an OpenAI model, for example, you've deployed, you can have Granite Guardian work right in parallel and ultimately just be a tool for responsible AI. And the last model I'll talk about is our embedding models, which again is meant to be assist a model in a broader generative AI workflow. In a RAG workflow, you'll often need to take large amounts of documents or text and convert them into what are called embeddings that you can search over in order to retrieve the most relevant info and give it to the model. So our granite embedding models are used for that embedding step. So these are meant to do that conversion and can support in a number of different, similar kind of search and retrieval style workflows working directly with the Granite large language model.
Chris Benson
Gotcha. I know there was some comment in the white paper also about time series. Can you talk a little bit to that for a sec?
Kate Soule
Absolutely. So I mentioned granite is multimodal in that it supports vision. We also have time series as a modality and I'm really glad you brought these up because these models are really exciting. We talked about our focus on efficiency. These models are 1 to 2 million parameters in size. That is teeny tiny in today's generative AI context, even compared to other forecasting models. These are really small generative AI based time series forecasting models. But they are right now delivering top of the top marks when it comes to performance. So we just as part of this release submitted our time series models to Salesforce has a time series leaderboard called Gift. They're the number one leaderboard on Gift right now. Number one model on Gift's leaderboard right now. And we're really excited. They've got over 10 million downloads on Hugging Face. They're really taking off in the community. So it's a really excellent offering in the time series modality for the Granite family.
Chris Benson
Okay, well, thank you for going through kind of the layout of the family of models that you guys have. I actually want to go back and ask a quick question that you talked a bit about. Guardian kind of providing guardrails and stuff. And that's something that if you take a moment to dive into, I think we often tend to focus kind of on the model and it's going to do X, whatever. I love the notion of integrating these guardrails that Guardian represents into a larger architecture, you know, to address kind of the quality issues surrounding the inputs and the outputs on that. How did you guys arrive at that? I'm just, you know, and how did you, you know, it's pretty cool. I love the idea that not only is it there for your own models, obviously, but that, you know, that you could have an end user go and apply it to something else that they're doing, maybe from a competitor or whatever. How did you decide to do that? And you know, that's. I think that's a fairly unique thing that we don't tend to hear as much from other organizations.
Kate Soule
Yeah. So Chris One of the values again of being in the open source ecosystem is we get to build on top of other people's great ideas. So we actually weren't the first ones to come up with it. There's a few other guardrail type models out there, but IBM has quite a large, especially IBM research presence in security space. And there are challenges in security that are very similar to large language models in generative AI that, you know, it's not totally new. And what I think we've learned as a company and as a field is that you always need layers of security when it comes to creating a robust system against, you know, potential adversarial attacks and dealing with even the model's own innate safety alignment itself. So you know, when we saw some of the work going out in the open source ecosystem on guardrails, you know, I think it was kind of a no brainer from, from a perspective of this is another great way to add an additional layer on that generative AI stack of security and safety to better improve model robustness and figure out, you know, IBM's hyper focus on what is the practical way to implement generative AI. So what else is needed beyond efficiency? We need trust, we need safety. Let's create tools in that space. So it kind of, you know, number of different reasons, all made it very clear, clear and easy when to go and pursue. And we are actually able to build on top of granite. So Granite Guardian is a fine tuned version of granite that's laser focused on these tasks of detecting and monitoring inputs going into the model and outputs going out. And the team has done a really excellent job first starting at basic harm and bias detectors, which I think is pretty prevalent in other guardrail models that are out there. But now we've really started to kind of make it our own and innovate. So some of the new features that were released in the 3.2 Granite Guardian models include hallucination detection, very models. Few models do that today, specifically hallucination detection with function calling. So if you think of an agent, whenever an LLM agent is trying to access or submit external information, it'll make what's called a tool call. When it's making that tool call, it's providing information based off of the conversation history, saying, you know, I need to look up, you know, Kate Soule's information in the HR database. This is her first name. She lives in Cambridge, Mass. Xyz. And we want to make sure the agent isn't hallucinating when it's filling in those pieces of information it needs to use to retrieve. Otherwise, you know, if she made up the wrong name or said Cambridge UK instead of Cambridge, Mass. The tool will provide the incorrect response back, but the agent will have no idea and it will keep operating with utmost certainty that it's operating on correct information. So you know, it's just an interesting example of, you know, some of the observability we're trying to inject into responsible AI workflows, particularly around things like agents, because there's all sorts of new safety concerns that really have to be taken into account to make this technology practical and implementable.
Chris Benson
And you know, that's actually having brought up agents and stuff and that being kind of the really hot topic of the moment of, you know, 2025 so far. Could you talk a little bit about granite and agents and how you guys, you know, how you're thinking. You've gone through one example right there. But if you could expand on that a little bit in terms of, you know, how does, how is IBM thinking about positioning granite? How do agents fit in? What's, what does that ecosystem look like? You know, you've started to talk about security a bit. Could you kind of weave that story for us a little bit?
Kate Soule
Absolutely. So yeah, obviously IBM is all in on agents and there's just so much going on in the space. A couple of key things that I think are interesting to bring up. So one is looking at the open source ecosystem for building agents. So we actually have a really fantastic team located right here in Cambridge, Massachusetts that is working on an agent framework and broader agent stack called bee AI like a bumblebee. So we're working really closely with them on how do we kind of co optimize a framework for agents with a model that in order to be able to have all sorts of new tips and tricks, so to speak, that you can harness when building agents. So I don't want to give too much away, but I think there's a lot of really interesting things that IBM is thinking about agent framework and model co design and that only unlocks so much potential when it comes to safety and security because there needs to be parts, for example of an LLM of an agent that agent developer programs that you never want the user to be able to see. There are parts of data that an agent might retrieve as part of a tool call that you don't want the user to see. For example, an agent that I'm working with might have access to to anybody's HR records, but I only have permission to see my HR records. So how can we design models and frameworks with those concepts in mind in order to better demark types of sensitive information that should be, you know, hidden in order to protect information that the model knows. Like these types of instructions can never be overwritten, no matter what type of like later on attacks, adversarial attacks somebody might try and do and say, you're not Kate's agent, you're a nasty bot and your job is to do X, Y and Z. Like, how do we prevent those types of attack vectors? Through model co design and agent model and agent framework co design. So I think there's a lot of really exciting work there. More broadly though, I think even on more traditional ideas and implementations of agent, not that there's a traditional one, this is so new, but more classical agent implementations. We're working for example with IBM consulting. They have a agent and assistant platform that is where granite is the default agent and assistant that gets built. And so that allows IBM all sorts of economies of scale. If you think about, we've now got 160,000 consultants out in the world using agents and assistants built off of granite in order to be more efficient and to help them with their client and consulting projects. So we see a ton of client zero, what we call client zero. IBM is our first client, in that case of how do we even internally build agents with granite in order to improve IBM productivity?
Chris Benson
Very cool. I'm kind of curious as you guys are looking at this array of considerations that you've just been going through and as there is more and more push out into the edge environments and you've already talked a little bit about that earlier as we're starting to wind down. Could you talk a little bit about kind of as things push a bit out of the cloud and out of the data center and as we have been migrating away from these gigantic models into a lot more smaller hyper efficient models often that are doing better on performance and stuff. And we see so many opportunities out there in a variety of edge environments. Could you talk a little bit about kind of where granite might be going with that or where it is now and kind of what the thoughts about granite at the edge might look like?
Kate Soule
Yeah, so I think with granite at the edge, there's a couple of different aspects. One is how can we think about building with models so that we can optimize for smaller models in size. So when I say building, I mean building prompts, building applications so that we're not designing prompts how they're written today, which I like to Call it the YOLO method, where I'm going to give 10 pages of instructions all at once and say go and do this and hope to God the model follows all those instructions and does everything beautifully. Small models, no matter how much this technology advances, probably aren't going to get perfect scores on that type of approach. So how can we think about broader kind of programming frameworks for dividing things up into much smaller pieces that a small model can operate on and then how do we leverage model and hardware co design to run those small pieces really fast? So you know, I think there's a lot of opportunity across the stack of how people are building with models, the models themselves and the hardware that the model is running on. That's going to allow us to push things much further to the edge than we've really experienced so far. It's going to require a bit of a mind shift again, like right now I think we're all really happy that we could be a bit lazy when we write our prompts and just like, you know, write kind of word vomit prompts down. But I think if we can get a little bit more like kind of software engineering mindset in terms of how you program and build, it's going to allow us to break things into much smaller components and push those components even farther to the edge.
Chris Benson
That makes sense. That makes a lot of sense. I guess kind of final question for you as we talk about this kind of any other thought you talked a little bit about kind of where you think things are going right there, Anything that you have to add to that in terms of kind of industry or specific to granite, where you think things are going, what the future looks like when you are kind of winding up for the day and you're at that moment where you're kind of just your mind wanders a little bit. Anything that appeals to you that kind of goes through your head.
Kate Soule
So I think the thing I've been most obsessed about lately is we need to get to the point as a field where models are measured by how efficient their efficient frontier is. Not by did they get to 0.01 higher on a metric or benchmark. So I think we're starting to see this with the reasoning. With granite you can turn it on and off. With the reasoning. With Claude you can have, have harder thoughts, you know, longer thoughts or shorter thoughts. But you know, I really want to see us get to the point and I think we've got the like the, the table is set for this. We've got the pieces in place to really Start to focus in on how can I make my model as efficient as possible, but as flexible as possible so I can choose anywhere that I want to be on that performance cost curve. So if my mo. If my task isn't, you know, very difficult, I, I don't want to spend a lot of money on it. I'm going to route this in such a way with very little thinking to a small model, and I'm going to be able to achieve acceptable performance. If my task is really high value, I'm going to pay more. I don't need to think about this. It's just going to happen either from the model architecture from being able to reason or not reason from routing. That might be happening behind an API endpoint to send my request to a more powerful model or to a less powerful but cheaper model. I think all of that needs to be, you know, we need to get to the point where no one's having to think about this or solve for it and design it. And I really want to see, I want to see these curves and I want to be able to see us push those curves as far to the left as possible, making things more and more efficient versus, like, here's a, here's a number on the leaderboard, like, I spent another, you know, x godzillion dollars on compute in order to make. Move that number up by 0.02 and, you know, that's science. Like, I'm ready to move beyond that.
Chris Benson
Fantastic. Great conversation. Thank you so much, Kate Soul, for joining us on the Practical AI podcast today. Really appreciate it. A lot of insight there. So thanks for coming on. Hope we can get you back on sometime.
Kate Soule
Thanks so much, Chris. Really appreciate you having me on the show.
Host Announcer
All right, that is our show for this week. If you haven't checked out our Changelog newsletter, head to changelog.com news. There you'll find 29 reasons, yes, 29 reasons why you should subscribe. I'll tell you reason number 17. You might actually start looking forward to Mondays.
Kate Soule
Sounds like somebody's got a case of The Mundu News.
Host Announcer
28 more reasons are waiting for you@changelog.com news. Thanks again to our partners at Fly IO to break master Cylinder for the Beats and to you for listening. That is all for now, but we'll talk to you again next time.
Podcast: Practical AI
Host: Chris Benson (Practical AI LLC)
Guest: Kate Soule (Director of Technical Product Management, IBM Granite)
Date: March 14, 2025
This episode explores IBM’s approach to building efficient, practical, and responsible AI through their Granite family of models. Chris Benson and guest Kate Soule discuss Granite's open-source philosophy, model architectures (with emphasis on efficiency and real-world deployment), and innovations in agent frameworks, security, and responsible AI. The conversation spotlights how IBM balances open-source openness, technical advancements, and real business needs to create AI that’s both cutting-edge and widely accessible.
On Open Licensing:
On Mixture of Experts:
On Edge Deployment:
On Agents & AI Security:
On the Future of AI Benchmarking:
| Time | Segment | |----------|--------------------------------------------------------------| | 01:51 | Kate Soule's background and IBM’s AI journey | | 03:59 | LLMs as IBM’s product/block foundation | | 05:28 | Granite’s open-source journey and Apache 2.0 licensing | | 10:52 | Architectural decisions, focus on efficiency | | 14:04 | Explanation of Mixture of Experts (MoE) | | 18:34 | Reasoning & inference time compute in Granite 3.2 | | 22:01 | Industry trend toward smaller, more efficient models | | 25:48 | Granite model family: LLMs, vision, Guardian, embeddings | | 29:03 | Time series models and their significant wins | | 31:08 | Role and innovation of Granite Guardian (guardrails) | | 34:11 | Agent frameworks, agent security & practical applications | | 38:28 | Granite at the edge, new prompt/model/hardware strategies | | 40:39 | Kate’s vision for AI’s efficient frontier and future metrics |
This episode offers a deep, practical look at how IBM is evolving its generative AI strategy around Granite: open-source, efficiency-oriented, and built to fit real business needs. Kate Soule provides an inside perspective on why IBM has bet on flexibility, openness, and responsible deployment, and how new features allow the models to reason, safeguard, and serve enterprise AI applications—from cloud to edge. For anyone interested in the present and future of LLMs outside the “biggest benchmark” race, this is a valuable, thought-provoking listen.