Practical AI Podcast: Optimizing for Efficiency with IBM’s Granite

Podcast: Practical AI
Host: Chris Benson (Practical AI LLC)
Guest: Kate Soule (Director of Technical Product Management, IBM Granite)
Date: March 14, 2025

Episode Overview

This episode explores IBM’s approach to building efficient, practical, and responsible AI through their Granite family of models. Chris Benson and guest Kate Soule discuss Granite's open-source philosophy, model architectures (with emphasis on efficiency and real-world deployment), and innovations in agent frameworks, security, and responsible AI. The conversation spotlights how IBM balances open-source openness, technical advancements, and real business needs to create AI that’s both cutting-edge and widely accessible.

Key Discussion Points & Insights

1. Kate Soule’s Background & IBM’s AI Mission

[01:51] Kate shares her journey from business consulting and data science into technical product management at IBM Research.
Observations on AI's "Netscape moment" post-November 2022 and the rapid business adoption of generative AI.
IBM's vision: Develop foundational AI models as flexible, reusable building blocks for numerous business applications.

2. Granite’s Role in IBM’s Product Ecosystem

[03:59] LLMs are now viewed as core 'building blocks' to be leveraged across IBM’s products and customer solutions, replacing need for many task-specific models.
Centralizing model development enables reuse and aids both internal and open-source community needs.

3. Open Source Philosophy and Licensing

[05:28] Granite is released under an Apache 2.0 license to foster broad, unrestricted usage and innovation.
- Quote (Kate Soule, 07:23):
  “We really wanted just to keep this simple, like a no-nonsense license that we felt would be able to promote the broadest use from the ecosystem without any restrictions. So we went with Apache 2 because that’s probably the most widely used and just easy to understand license that’s out there.”
IBM’s historical commitment to open source through ventures like Red Hat influenced the ease of this decision.

4. Architectural Decisions & Shifting Paradigms

[10:52] Early approaches borrowed established architectures but innovated chiefly in ethical data curation.
The field’s rapid evolution:
- Older: “as many parameters as possible, minimal data”
- Now: “maximize data, minimize parameters for efficient inference”
A strong focus on making models more economical and efficient for practical use, not just academic benchmarks.

5. Mixture of Experts (MoE) and Model Efficiency

[14:04] Introduction to Mixture of Experts:
- Models contain various “expert” groups, and only a subset are activated per inference, reducing computational load.
- MoE versions of Granite include 1B and 3B parameter models designed for speed and edge deployment.
[16:10] Different model sizes are targeted for different deployment scenarios — from local devices to GPU clusters.

6. Reasoning Capabilities and ‘Inference Time Compute’

[18:34] Granite 3.2 introduces experimental reasoning via “chain of thought”:
- At inference, models can “think” longer, yielding more elaborate, step-by-step answers (or faster, more direct ones if brevity is preferred).
- Quote (Kate Soule, 18:34):
  “If you think longer and harder about a prompt, about a question, you can get a better response… and the same is true for large language models.”
Future models may selectively enable reasoning for both performance and explainability.

7. The Trend Toward Smaller, Efficient Models

[22:01] Shift from massive models toward smaller models that meet most business needs without demanding resources.
Quote (Kate Soule, 23:18):
“Even without thinking about it… small models are increasingly able to do what it took a big model to do yesterday… The technology is just moving so quickly.”
Granite is positioned to serve “the 80% of use cases” that can be satisfied by models with 8 billion parameters or fewer.

8. The IBM Granite Family: Products and Capabilities

[25:48] Overview:
- Language Models: 1B–8B parameters; main “workhorse” models.
- Vision Models: For understanding (not creating) images—optimized for PDF/doc/chart Q&A and multi-modal RAG workflows.
- Granite Guardian: Guardrail models for detecting harmful prompts/outputs, hallucinations, and providing robust AI governance. Can work with competing models.
  - Quote (Kate Soule, 31:08):
    “We need trust, we need safety. Let’s create tools in that space… [Granite Guardian] is a fine-tuned version of granite that's laser focused on these tasks.”
- Embedding Models: For search, retrieval, and RAG pipelines.
- Time Series Models: Ultra-lightweight models (1-2M parameters!) for forecasting, topping benchmarks like Salesforce GIFT leaderboard.

9. Responsible AI, Security & Agent Architectures

[34:11] Granite Guardian offers advanced safety features including adversarial prompt detection, hallucination spotting (especially for LLM agents and function calling).
IBM draws on its security expertise, layering AI security much as in traditional IT security stacks.
[34:45] Agents: IBM is “all in,” co-designing frameworks (e.g. Bee AI) and models to integrate security, data access permissions, and prevent adversarial behavior.

10. Granite at the Edge & Model/Prompt Engineering

[38:28] Emphasis on pushing models to the “edge” (local devices, IoT, etc.) using smaller, efficient models.
Calls for new programming frameworks and prompt-writing methodologies to leverage smaller models effectively—breaking up large tasks into smaller chunks, in a software engineering mindset.

11. The Future: Efficiency Frontiers

[40:39] Kate Soule advocates for a new standard for AI evaluation:
- Models evaluated for efficiency frontiers (cost vs. performance/flexibility), not just beating benchmarks by tiny margins at huge expense.
- Quote (Kate Soule, 40:39):
  “We need to get to the point as a field where models are measured by how efficient their efficient frontier is. Not by did they get to 0.01 higher on a metric or benchmark.”

Notable Quotes & Memorable Moments

On Open Licensing:
- “We felt that OpenAI was a far more responsible environment to develop and to incubate this technology as a whole.” (Kate Soule, 05:55)
- “Models are a bit of a weird artifact…they’re not code…they’re not data per se…but they are kind of like a big bag of numbers at the end of the day.” (Kate Soule, 06:39)
On Mixture of Experts:
- “Do I really need all 1 billion parameters every single time I run inference? Can I use a subset?” (Kate Soule, 14:04)
On Edge Deployment:
- “How can we think about broader kind of programming frameworks…that a small model can operate on and then how do we leverage model and hardware co-design to run those small pieces really fast?” (Kate Soule, 38:28)
On Agents & AI Security:
- “There are parts of data that an agent might retrieve as part of a tool call that you don’t want the user to see…How can we design models and frameworks with those concepts in mind?” (Kate Soule, 34:45)
On the Future of AI Benchmarking:
- “I really want to see us get to the point…where nobody’s having to think about this or solve for it or design it, and I really want to see…us push those curves as far to the left as possible, making things more and more efficient.” (Kate Soule, 40:39)

Timestamps for Key Segments

| Time | Segment | |----------|--------------------------------------------------------------| | 01:51 | Kate Soule's background and IBM’s AI journey | | 03:59 | LLMs as IBM’s product/block foundation | | 05:28 | Granite’s open-source journey and Apache 2.0 licensing | | 10:52 | Architectural decisions, focus on efficiency | | 14:04 | Explanation of Mixture of Experts (MoE) | | 18:34 | Reasoning & inference time compute in Granite 3.2 | | 22:01 | Industry trend toward smaller, more efficient models | | 25:48 | Granite model family: LLMs, vision, Guardian, embeddings | | 29:03 | Time series models and their significant wins | | 31:08 | Role and innovation of Granite Guardian (guardrails) | | 34:11 | Agent frameworks, agent security & practical applications | | 38:28 | Granite at the edge, new prompt/model/hardware strategies | | 40:39 | Kate’s vision for AI’s efficient frontier and future metrics |

Summary

This episode offers a deep, practical look at how IBM is evolving its generative AI strategy around Granite: open-source, efficiency-oriented, and built to fit real business needs. Kate Soule provides an inside perspective on why IBM has bet on flexibility, openness, and responsible deployment, and how new features allow the models to reason, safeguard, and serve enterprise AI applications—from cloud to edge. For anyone interested in the present and future of LLMs outside the “biggest benchmark” race, this is a valuable, thought-provoking listen.

Practical AI Podcast: Optimizing for Efficiency with IBM’s Granite

Podcast: Practical AI
Host: Chris Benson (Practical AI LLC)
Guest: Kate Soule (Director of Technical Product Management, IBM Granite)
Date: March 14, 2025

Episode Overview

Key Discussion Points & Insights

1. Kate Soule’s Background & IBM’s AI Mission

[01:51] Kate shares her journey from business consulting and data science into technical product management at IBM Research.
Observations on AI's "Netscape moment" post-November 2022 and the rapid business adoption of generative AI.
IBM's vision: Develop foundational AI models as flexible, reusable building blocks for numerous business applications.

2. Granite’s Role in IBM’s Product Ecosystem

[03:59] LLMs are now viewed as core 'building blocks' to be leveraged across IBM’s products and customer solutions, replacing need for many task-specific models.
Centralizing model development enables reuse and aids both internal and open-source community needs.

3. Open Source Philosophy and Licensing

[05:28] Granite is released under an Apache 2.0 license to foster broad, unrestricted usage and innovation.
- Quote (Kate Soule, 07:23):
  “We really wanted just to keep this simple, like a no-nonsense license that we felt would be able to promote the broadest use from the ecosystem without any restrictions. So we went with Apache 2 because that’s probably the most widely used and just easy to understand license that’s out there.”
IBM’s historical commitment to open source through ventures like Red Hat influenced the ease of this decision.

4. Architectural Decisions & Shifting Paradigms

[10:52] Early approaches borrowed established architectures but innovated chiefly in ethical data curation.
The field’s rapid evolution:
- Older: “as many parameters as possible, minimal data”
- Now: “maximize data, minimize parameters for efficient inference”
A strong focus on making models more economical and efficient for practical use, not just academic benchmarks.

5. Mixture of Experts (MoE) and Model Efficiency

[14:04] Introduction to Mixture of Experts:
- Models contain various “expert” groups, and only a subset are activated per inference, reducing computational load.
- MoE versions of Granite include 1B and 3B parameter models designed for speed and edge deployment.
[16:10] Different model sizes are targeted for different deployment scenarios — from local devices to GPU clusters.

6. Reasoning Capabilities and ‘Inference Time Compute’

[18:34] Granite 3.2 introduces experimental reasoning via “chain of thought”:
- At inference, models can “think” longer, yielding more elaborate, step-by-step answers (or faster, more direct ones if brevity is preferred).
- Quote (Kate Soule, 18:34):
  “If you think longer and harder about a prompt, about a question, you can get a better response… and the same is true for large language models.”
Future models may selectively enable reasoning for both performance and explainability.

7. The Trend Toward Smaller, Efficient Models

[22:01] Shift from massive models toward smaller models that meet most business needs without demanding resources.
Quote (Kate Soule, 23:18):
“Even without thinking about it… small models are increasingly able to do what it took a big model to do yesterday… The technology is just moving so quickly.”
Granite is positioned to serve “the 80% of use cases” that can be satisfied by models with 8 billion parameters or fewer.

8. The IBM Granite Family: Products and Capabilities

[25:48] Overview:
- Language Models: 1B–8B parameters; main “workhorse” models.
- Vision Models: For understanding (not creating) images—optimized for PDF/doc/chart Q&A and multi-modal RAG workflows.
- Granite Guardian: Guardrail models for detecting harmful prompts/outputs, hallucinations, and providing robust AI governance. Can work with competing models.
  - Quote (Kate Soule, 31:08):
    “We need trust, we need safety. Let’s create tools in that space… [Granite Guardian] is a fine-tuned version of granite that's laser focused on these tasks.”
- Embedding Models: For search, retrieval, and RAG pipelines.
- Time Series Models: Ultra-lightweight models (1-2M parameters!) for forecasting, topping benchmarks like Salesforce GIFT leaderboard.

9. Responsible AI, Security & Agent Architectures

[34:11] Granite Guardian offers advanced safety features including adversarial prompt detection, hallucination spotting (especially for LLM agents and function calling).
IBM draws on its security expertise, layering AI security much as in traditional IT security stacks.
[34:45] Agents: IBM is “all in,” co-designing frameworks (e.g. Bee AI) and models to integrate security, data access permissions, and prevent adversarial behavior.

10. Granite at the Edge & Model/Prompt Engineering

[38:28] Emphasis on pushing models to the “edge” (local devices, IoT, etc.) using smaller, efficient models.
Calls for new programming frameworks and prompt-writing methodologies to leverage smaller models effectively—breaking up large tasks into smaller chunks, in a software engineering mindset.

11. The Future: Efficiency Frontiers

[40:39] Kate Soule advocates for a new standard for AI evaluation:
- Models evaluated for efficiency frontiers (cost vs. performance/flexibility), not just beating benchmarks by tiny margins at huge expense.
- Quote (Kate Soule, 40:39):
  “We need to get to the point as a field where models are measured by how efficient their efficient frontier is. Not by did they get to 0.01 higher on a metric or benchmark.”

Notable Quotes & Memorable Moments

On Open Licensing:
- “We felt that OpenAI was a far more responsible environment to develop and to incubate this technology as a whole.” (Kate Soule, 05:55)
- “Models are a bit of a weird artifact…they’re not code…they’re not data per se…but they are kind of like a big bag of numbers at the end of the day.” (Kate Soule, 06:39)
On Mixture of Experts:
- “Do I really need all 1 billion parameters every single time I run inference? Can I use a subset?” (Kate Soule, 14:04)
On Edge Deployment:
- “How can we think about broader kind of programming frameworks…that a small model can operate on and then how do we leverage model and hardware co-design to run those small pieces really fast?” (Kate Soule, 38:28)
On Agents & AI Security:
- “There are parts of data that an agent might retrieve as part of a tool call that you don’t want the user to see…How can we design models and frameworks with those concepts in mind?” (Kate Soule, 34:45)
On the Future of AI Benchmarking:
- “I really want to see us get to the point…where nobody’s having to think about this or solve for it or design it, and I really want to see…us push those curves as far to the left as possible, making things more and more efficient.” (Kate Soule, 40:39)

wavePod

Optimizing for efficiency with IBM’s Granite

Summary

Practical AI Podcast: Optimizing for Efficiency with IBM’s Granite

Episode Overview

Key Discussion Points & Insights

1. Kate Soule’s Background & IBM’s AI Mission

2. Granite’s Role in IBM’s Product Ecosystem

3. Open Source Philosophy and Licensing

4. Architectural Decisions & Shifting Paradigms

5. Mixture of Experts (MoE) and Model Efficiency

6. Reasoning Capabilities and ‘Inference Time Compute’

7. The Trend Toward Smaller, Efficient Models

8. The IBM Granite Family: Products and Capabilities

9. Responsible AI, Security & Agent Architectures

10. Granite at the Edge & Model/Prompt Engineering

11. The Future: Efficiency Frontiers

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Summary

Summary

Practical AI Podcast: Optimizing for Efficiency with IBM’s Granite

Episode Overview

Key Discussion Points & Insights

1. Kate Soule’s Background & IBM’s AI Mission

2. Granite’s Role in IBM’s Product Ecosystem

3. Open Source Philosophy and Licensing

4. Architectural Decisions & Shifting Paradigms

5. Mixture of Experts (MoE) and Model Efficiency

6. Reasoning Capabilities and ‘Inference Time Compute’

7. The Trend Toward Smaller, Efficient Models

8. The IBM Granite Family: Products and Capabilities

9. Responsible AI, Security & Agent Architectures

10. Granite at the Edge & Model/Prompt Engineering

11. The Future: Efficiency Frontiers

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Summary