Latent Space: The AI Engineer Podcast

Episode: Artificial Analysis: Independent LLM Evals as a Service

Guests: George Cameron & Micah-Hill Smith
Host: Latent.Space
Release Date: January 8, 2026

Episode Overview

This episode gives listeners a deep dive into Artificial Analysis (AA), a startup that has rapidly become an industry leader in independent benchmarking and evaluation of large language models (LLMs) and AI agents. Founders George Cameron and Micah-Hill Smith discuss their journey building an independent, transparent, and highly influential "Gartner of AI"—detailing the practices, philosophy, and technical rigor behind AA’s benchmarking and insights as a service, plus their broader impact on the ecosystem.

From the need for objective model evaluation and the nuances of benchmarking practices, to public and private business models, hot-off-the-press evals like the Omniscience (hallucination) Index and GDPVAL agentic benchmarks, as well as trends in cost, hardware, and openness, this episode offers a comprehensive crash course for anyone seeking a technical and business perspective on the present and future of AI model evaluation.

The Genesis of Artificial Analysis
Business Model & Independence
Technical Evolution of Benchmarking
Key Benchmarks & New Indices
Industry Impact & Openness
Trends: Cost, Hardware, and Agentic Evolutions
The Future of Benchmarks and AI Development
Notable Quotes & Moments

1. The Genesis of Artificial Analysis

Origins (04:40)
- AA was born out of necessity: both founders, working on AI projects (e.g., a legal AI assistant), found a lack of independent, apples-to-apples LLM benchmarks, especially concerning trade-offs between cost, accuracy, and speed.
- “The more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem.” – Micah (04:47)
From Side Project to Essential Service (06:06, 06:45)
- Initially a side project, AA’s website gained immediate traction due to its practical value.
- Community amplification—especially by Swix—helped AA quickly reach a broader audience.
The Early Landscape (07:01, 08:05)
- Before AA, evaluation was fragmented (papers copying results, unreliable comparisons, inconsistent formats).
- Standard open-source tools (EleutherAI's harness) existed but lacked consistent usage.

2. Business Model & Independence

Revenue Streams (01:24, 02:53)
- AA operates with around 20 people.
- Two main customer groups:
  1. Enterprises: Subscription to benchmarking insights (standardized reports, model/provider comparisons).
  2. AI Companies: Private, custom benchmarking for AI products/technologies.
- Crucially, no payment for public website inclusion—ensuring independence.
Maintaining Integrity (13:21, 14:23)
- Careful policies to prevent vendor manipulation, including "mystery shopper" evaluations.
- “We laser focus on having the best independent metrics… no one can manipulate them.” – Micah (13:21)

3. Technical Evolution of Benchmarking

Benchmarking Challenges (08:05, 10:33, 11:28)
- Model responses often vary due to formatting, randomness, prompting style—parsing outputs proved non-trivial.
- Nuances: randomizing multiple-choice order, controlling for variance, standardizing eval methodology.
Scale & Cost (09:39, 12:37, 13:03)
- Early cost was manageable (hundreds of dollars, few models).
- Now, costs have risen substantially due to more models, deeper/evolved evals, and statistical repeats for confidence.
Handling Lab Incentives (14:34, 15:58)
- Discussed "Goodhart’s Law": labs target benchmarked metrics, sometimes at the expense of real-world utility.
- Importance of evolving benchmarks to prevent overfitting and to better reflect user needs.

4. Key Benchmarks & New Indices

Artificial Analysis Intelligence Index (20:47–22:12, 23:02)
- Composite metric built from a curated, evolving set of evals (Q&A, agentic, long-context, use-case datasets).
- “Current best single number to look at for how smart the models are.”
Historical Progress (23:17, 24:31)
- Major leaps in model capability; OpenAI’s dominance has given way to strong, competitive, and often open alternatives.
- Charting shows explosion of quality and variety in the last two years.

Notable New Indices

Omniscience Index (Hallucination Metric) (28:26–32:47)
- Measures models’ ability to say “I don’t know” vs. giving incorrect answers.
- “We’re pretty convinced this is an example where it makes most sense to penalize incorrect over uncertain… it’s strictly more helpful to say ‘I don’t know.’” – Micah (28:52)
- Anthropic’s Claude models lead in low hallucination rates.
- 10% of dataset public, rest held out to minimize contamination.
Critical Point Eval (33:31–34:19)
- Hard, research-level physics problems—shows where models can/must “hallucinate” to explore.
- Models perform poorly on this (top score: 9%), but evaluated for creative exploration, not just correctness.
GDPVAL Agentic Benchmark (39:59–44:47)
- 44 real-world, white-collar tasks (with files, multimedia), moving beyond simple Q&A.
- AA built a reference harness and a Gemini-judged comparative ELO scoring system.
- Found that models like GPT-4 perform better in AA’s own eval harness than in web chatbots.
- “The agentic harness is very minimalist… context management, web search, code exec… it works very well.” – George (49:17)
Openness Index (52:07–58:04)
- Scores model openness not just by license/weights, but transparency of data, methodology, training code.
- Top models currently AI2’s openweight models.
- Bonus points for full data/method disclosures; recognizes industrial impact of open contributions.

5. Industry Impact & Openness

Openness vs. Performance Trade-Off (54:43)
- “Obviously you can be super open but dumb.”
- Meta’s Llama licensing quirks, Nvidia’s secret role in enabling open models (56:18).
Community & Ecosystem (16:26, 18:10, 19:08)
- AA’s time at AI Grant: insight, mentorship, exposure to "power users" at the application frontier.

6. Trends: Cost, Hardware, and Agentic Evolutions

Smiling Curve: The Paradox of AI Costs (58:26–61:49)
- “The cost of intelligence has been falling dramatically over the last couple of years… but it is possible to spend much more AI inference now than it was a couple years ago.”
- Slide demonstrates 100–1000x drop in GPT-4-level intelligence costs, yet total industry spend (Nvidia, HF0 clients) surges due to new use cases, bigger models, agentic workflows.
Hardware Trends (62:03–64:11)
- Blackwell GPUs bring huge gains, especially for sparse, large models.
- Ongoing hardware and software advances will further lower cost/token and enable larger, more capable models.
Sparsity and Model Scaling (64:11–65:54)
- Debate over how sparse models can responsibly go; open weights models see 3–5% active parameters.
Reasoning vs. Non-Reasoning Models & Token Efficiency (66:01–70:56)
- Old dichotomy breaking down; continuous spectrum as models adapt token/cost to problem complexity.
- “What you want is more tokens for hard cases, less for easy—but some models are better at adjusting than others.”
- Multi-turn benchmarks (agentic evals) likely to grow in importance.

7. The Future of Benchmarks and AI Development

Multimodal & New Arenas (71:03–72:17)
- Expanding to speech, image, and video benchmarking.
- Soliciting community-curated eval categories to shape what gets measured (and therefore, solved).
What’s Next? (73:06–76:07)
- Trends toward continually smarter, more “personable” and less hallucination-prone models will persist.
- V4 of the Intelligence Index will integrate GDPVAL for agentic tasks, Critical Point for hard reasoning, omniscience/hallucination rates.
Long-Term Constant:
- “The demand for AI intelligence and smarter AI intelligence is going to be insatiable.” – George (73:56)
- “Truths that don’t change are the best thing to build and plan around.” – Host (73:35)

8. Notable Quotes & Memorable Moments

The Mission:

"We want to be who enterprises look to for data and insights on AI—to help them with their decisions about models and technologies for building stuff. And then... do private benchmarking for companies throughout the AI stack." – Micah (01:24)
On Benchmarking Complexity:

"The more you go into building something using LLMs, the more each bit... ends up being a benchmarking problem." – Micah (04:47)
Why True Independence Matters:

"No one pays to be on the website... there's no use doing what we do unless it's independent AI benchmarking." – Micah (01:24)
On Hallucinations:

"It’s strictly more helpful to say ‘I don’t know’ instead of giving a wrong answer to factual knowledge questions." – Micah (28:52)
On Industry Trends:

"It is truly the case that we've had this 100x to 1000x decline in the cost of GPT4-level intelligence... yet on the right-hand side, because the multipliers are so big, you can still spend enormously more today." – Micah (60:55)
Personality & Forward-Looking Benchmarks:

"We'll keep benchmarking raw intelligence, but we also want to... explore models more deeply—hallucination, behavior, personalities—to help people make more nuanced decisions." – George (74:49)
Values-Driven Design:

"Every index you push encodes some kind of opinion or value." – Host (56:49) "It is hard to weight for the materiality of the contribution to open source… but we want to recognize that opening all the data and code is still a very useful exercise.” – Micah (55:36)

Timestamps for Key Segments

| Topic | Timestamps | | -------------------------------------------------- | ------------- | | The origin story, need for independent benchmarks | 04:40–06:45 | | Independence & anti-shenanigan policies | 13:21–14:23 | | Technical benchmarking challenges | 08:05–11:47 | | Business model & client engagements | 01:24–02:53 | | Evolution of the Intelligence Index | 20:47–22:12 | | The “Smiling Curve” of AI cost trends | 58:26–61:49 | | New hallucination & knowledge indices | 28:26–32:47 | | GDPVAL and general agentic benchmarks | 39:59–44:47 | | Hardware & sparsity trends | 62:03–65:54 | | Reasoning, token efficiency, multi-turn evals | 66:01–70:56 | | Charting industry openness | 52:07–58:04 | | Future of benchmarks | 73:06–76:07 | | Reflections and community impact | 77:09–78:12 |

Tone & Style

The episode is deeply technical but balanced with wit and reflection on industry culture (“I know many smart people who are confidently incorrect.” – George, 33:03). Candid discussions of open source, community dynamics, and the ongoing horse race between labs add a grounded, practical feel.

Summary Takeaways

Artificial Analysis stands at the heart of the AI ecosystem with transparent, rigorous, continually evolving model evaluation, shaping not only how labs build but how enterprises adopt LLMs.
The future will see more holistic benchmarking—agentic, open, and trend-driven—reflecting both technical needs and the values of the open AI community.
Listeners gain a rare window into the intersection of technical benchmarking, business, and the meta-trends driving AI’s rapid evolution.

For a deeper dive—including interactive charts and reports—visit artificialanalysis.ai and latent.space.

Latent Space: The AI Engineer Podcast

Episode: Artificial Analysis: Independent LLM Evals as a Service

Guests: George Cameron & Micah-Hill Smith
Host: Latent.Space
Release Date: January 8, 2026

Episode Overview

The Genesis of Artificial Analysis
Business Model & Independence
Technical Evolution of Benchmarking
Key Benchmarks & New Indices
Industry Impact & Openness
Trends: Cost, Hardware, and Agentic Evolutions
The Future of Benchmarks and AI Development
Notable Quotes & Moments

1. The Genesis of Artificial Analysis

Origins (04:40)
- AA was born out of necessity: both founders, working on AI projects (e.g., a legal AI assistant), found a lack of independent, apples-to-apples LLM benchmarks, especially concerning trade-offs between cost, accuracy, and speed.
- “The more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem.” – Micah (04:47)
From Side Project to Essential Service (06:06, 06:45)
- Initially a side project, AA’s website gained immediate traction due to its practical value.
- Community amplification—especially by Swix—helped AA quickly reach a broader audience.
The Early Landscape (07:01, 08:05)
- Before AA, evaluation was fragmented (papers copying results, unreliable comparisons, inconsistent formats).
- Standard open-source tools (EleutherAI's harness) existed but lacked consistent usage.

2. Business Model & Independence

Revenue Streams (01:24, 02:53)
- AA operates with around 20 people.
- Two main customer groups:
  1. Enterprises: Subscription to benchmarking insights (standardized reports, model/provider comparisons).
  2. AI Companies: Private, custom benchmarking for AI products/technologies.
- Crucially, no payment for public website inclusion—ensuring independence.
Maintaining Integrity (13:21, 14:23)
- Careful policies to prevent vendor manipulation, including "mystery shopper" evaluations.
- “We laser focus on having the best independent metrics… no one can manipulate them.” – Micah (13:21)

3. Technical Evolution of Benchmarking

Benchmarking Challenges (08:05, 10:33, 11:28)
- Model responses often vary due to formatting, randomness, prompting style—parsing outputs proved non-trivial.
- Nuances: randomizing multiple-choice order, controlling for variance, standardizing eval methodology.
Scale & Cost (09:39, 12:37, 13:03)
- Early cost was manageable (hundreds of dollars, few models).
- Now, costs have risen substantially due to more models, deeper/evolved evals, and statistical repeats for confidence.
Handling Lab Incentives (14:34, 15:58)
- Discussed "Goodhart’s Law": labs target benchmarked metrics, sometimes at the expense of real-world utility.
- Importance of evolving benchmarks to prevent overfitting and to better reflect user needs.

4. Key Benchmarks & New Indices

Artificial Analysis Intelligence Index (20:47–22:12, 23:02)
- Composite metric built from a curated, evolving set of evals (Q&A, agentic, long-context, use-case datasets).
- “Current best single number to look at for how smart the models are.”
Historical Progress (23:17, 24:31)
- Major leaps in model capability; OpenAI’s dominance has given way to strong, competitive, and often open alternatives.
- Charting shows explosion of quality and variety in the last two years.

Notable New Indices

Omniscience Index (Hallucination Metric) (28:26–32:47)
- Measures models’ ability to say “I don’t know” vs. giving incorrect answers.
- “We’re pretty convinced this is an example where it makes most sense to penalize incorrect over uncertain… it’s strictly more helpful to say ‘I don’t know.’” – Micah (28:52)
- Anthropic’s Claude models lead in low hallucination rates.
- 10% of dataset public, rest held out to minimize contamination.
Critical Point Eval (33:31–34:19)
- Hard, research-level physics problems—shows where models can/must “hallucinate” to explore.
- Models perform poorly on this (top score: 9%), but evaluated for creative exploration, not just correctness.
GDPVAL Agentic Benchmark (39:59–44:47)
- 44 real-world, white-collar tasks (with files, multimedia), moving beyond simple Q&A.
- AA built a reference harness and a Gemini-judged comparative ELO scoring system.
- Found that models like GPT-4 perform better in AA’s own eval harness than in web chatbots.
- “The agentic harness is very minimalist… context management, web search, code exec… it works very well.” – George (49:17)
Openness Index (52:07–58:04)
- Scores model openness not just by license/weights, but transparency of data, methodology, training code.
- Top models currently AI2’s openweight models.
- Bonus points for full data/method disclosures; recognizes industrial impact of open contributions.

5. Industry Impact & Openness

Openness vs. Performance Trade-Off (54:43)
- “Obviously you can be super open but dumb.”
- Meta’s Llama licensing quirks, Nvidia’s secret role in enabling open models (56:18).
Community & Ecosystem (16:26, 18:10, 19:08)
- AA’s time at AI Grant: insight, mentorship, exposure to "power users" at the application frontier.

6. Trends: Cost, Hardware, and Agentic Evolutions

Smiling Curve: The Paradox of AI Costs (58:26–61:49)
- “The cost of intelligence has been falling dramatically over the last couple of years… but it is possible to spend much more AI inference now than it was a couple years ago.”
- Slide demonstrates 100–1000x drop in GPT-4-level intelligence costs, yet total industry spend (Nvidia, HF0 clients) surges due to new use cases, bigger models, agentic workflows.
Hardware Trends (62:03–64:11)
- Blackwell GPUs bring huge gains, especially for sparse, large models.
- Ongoing hardware and software advances will further lower cost/token and enable larger, more capable models.
Sparsity and Model Scaling (64:11–65:54)
- Debate over how sparse models can responsibly go; open weights models see 3–5% active parameters.
Reasoning vs. Non-Reasoning Models & Token Efficiency (66:01–70:56)
- Old dichotomy breaking down; continuous spectrum as models adapt token/cost to problem complexity.
- “What you want is more tokens for hard cases, less for easy—but some models are better at adjusting than others.”
- Multi-turn benchmarks (agentic evals) likely to grow in importance.

7. The Future of Benchmarks and AI Development

Multimodal & New Arenas (71:03–72:17)
- Expanding to speech, image, and video benchmarking.
- Soliciting community-curated eval categories to shape what gets measured (and therefore, solved).
What’s Next? (73:06–76:07)
- Trends toward continually smarter, more “personable” and less hallucination-prone models will persist.
- V4 of the Intelligence Index will integrate GDPVAL for agentic tasks, Critical Point for hard reasoning, omniscience/hallucination rates.
Long-Term Constant:
- “The demand for AI intelligence and smarter AI intelligence is going to be insatiable.” – George (73:56)
- “Truths that don’t change are the best thing to build and plan around.” – Host (73:35)

8. Notable Quotes & Memorable Moments

The Mission:

"We want to be who enterprises look to for data and insights on AI—to help them with their decisions about models and technologies for building stuff. And then... do private benchmarking for companies throughout the AI stack." – Micah (01:24)
On Benchmarking Complexity:

"The more you go into building something using LLMs, the more each bit... ends up being a benchmarking problem." – Micah (04:47)
Why True Independence Matters:

"No one pays to be on the website... there's no use doing what we do unless it's independent AI benchmarking." – Micah (01:24)
On Hallucinations:

"It’s strictly more helpful to say ‘I don’t know’ instead of giving a wrong answer to factual knowledge questions." – Micah (28:52)
On Industry Trends:

"It is truly the case that we've had this 100x to 1000x decline in the cost of GPT4-level intelligence... yet on the right-hand side, because the multipliers are so big, you can still spend enormously more today." – Micah (60:55)
Personality & Forward-Looking Benchmarks:

"We'll keep benchmarking raw intelligence, but we also want to... explore models more deeply—hallucination, behavior, personalities—to help people make more nuanced decisions." – George (74:49)
Values-Driven Design:

"Every index you push encodes some kind of opinion or value." – Host (56:49) "It is hard to weight for the materiality of the contribution to open source… but we want to recognize that opening all the data and code is still a very useful exercise.” – Micah (55:36)

Timestamps for Key Segments

Tone & Style

Summary Takeaways

Artificial Analysis stands at the heart of the AI ecosystem with transparent, rigorous, continually evolving model evaluation, shaping not only how labs build but how enterprises adopt LLMs.
The future will see more holistic benchmarking—agentic, open, and trend-driven—reflecting both technical needs and the values of the open AI community.
Listeners gain a rare window into the intersection of technical benchmarking, business, and the meta-trends driving AI’s rapid evolution.

For a deeper dive—including interactive charts and reports—visit artificialanalysis.ai and latent.space.

wavePod

Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Summary

Latent Space: The AI Engineer Podcast

Episode: Artificial Analysis: Independent LLM Evals as a Service

Episode Overview

Table of Contents

1. The Genesis of Artificial Analysis <span id="genesis"></span>

2. Business Model & Independence <span id="business-model"></span>

3. Technical Evolution of Benchmarking <span id="technical-evolution"></span>

4. Key Benchmarks & New Indices <span id="key-benchmarks"></span>

Notable New Indices

5. Industry Impact & Openness <span id="industry-impact"></span>

6. Trends: Cost, Hardware, and Agentic Evolutions <span id="trends"></span>

7. The Future of Benchmarks and AI Development <span id="future"></span>

8. Notable Quotes & Memorable Moments <span id="quotes"></span>

Timestamps for Key Segments

Tone & Style

Summary Takeaways

Summary

Latent Space: The AI Engineer Podcast

Episode: Artificial Analysis: Independent LLM Evals as a Service

Episode Overview

Table of Contents

1. The Genesis of Artificial Analysis <span id="genesis"></span>

2. Business Model & Independence <span id="business-model"></span>

3. Technical Evolution of Benchmarking <span id="technical-evolution"></span>

4. Key Benchmarks & New Indices <span id="key-benchmarks"></span>

Notable New Indices

5. Industry Impact & Openness <span id="industry-impact"></span>

6. Trends: Cost, Hardware, and Agentic Evolutions <span id="trends"></span>

7. The Future of Benchmarks and AI Development <span id="future"></span>

8. Notable Quotes & Memorable Moments <span id="quotes"></span>

Timestamps for Key Segments

Tone & Style

Summary Takeaways