Podcast Summary: Personalized AI Language Education — with Andrew Hsu, Speak

Latent Space: The AI Engineer Podcast
Date: July 11, 2025
Host(s): Alessio (A), Wicks (B)
Guest: Andrew Hsu (C), CTO & Co-founder of Speak

Episode Overview

This episode explores how foundation AI models are revolutionizing language learning, featuring Andrew Hsu of Speak—one of the leading AI-powered language education companies. Andrew shares the story of building Speak, its technical and product evolution, and broader reflections on the future of human learning with AI. In a lively and candid conversation, the hosts delve into:

The genesis, growth, and product philosophy behind Speak
Technical and product challenges scaling AI-powered language learning
The role of speech and real-time interaction in modern edtech
Personal stories, global expansion, and what the future holds for learning with AI

Key Discussion Points & Insights

1. Founding Story and Early Days of Speak

Andrew’s background as a Teal Fellow (00:16)
- Met his co-founder through the fellowship.
- "For me it was life changing. I had a very unusual path where I did finish college...I was 19 at the time and in grad school..." [00:23-01:22]
Inspiration: "We were just so convinced, fundamentally, that speech models were going like this, language models were going like this, and in the five to ten year span they would become superhuman." [02:17]
Early vision: Build a pure software, AI-powered language tutor—no human in the loop.

2. Product Focus – Speaking as the Core Modality

Speak’s focus is on speech and real conversational fluency, not passive forms.
Early pain points and perseverance: "It took much, much longer than we expected to build a great product and find good PMF. The first few years were very painful and I think without this really compelling vision of the future, we would have quit." [02:17-03:25]
Never pivoted from the original vision.

3. Platform Evolution and Onboarding

Custom speech recognition models before LLMs/Whisper (pre-2022).
Their "magic onboarding" leverages conversational AI for personalized onboarding:
- "We wanted it to feel more like you were talking with a tutor...we would use that later to personalize the experience." [06:25]
Tradeoff: Speaking is a higher barrier but creates more engaged trial users.

Notable:

Learning from both state machines and open-ended LLM prompt approaches in onboarding.

4. AI-Native Approach and Differentiation

Gen 3 language learning: From Rosetta Stone (Gen 1) to Duolingo (Gen 2, game-like) to Speak’s LLM-powered, role-play practice (Gen 3).
"LLMs and AI now enable Gen 3 of language learning, which is something that is very AI native, very focused on functional fluency..." [11:00]
Users want real-world fluency—"We try to get you to just repeat and drill and drill and drill, almost like you're in a gym until it's automatic, because that's what speaking is, right?" [11:00]

5. Market Strategy and Growth

Initial (and still dominant) success in South Korea (largest English app: "I think like, 6% of the Korean population has tried us..." [13:55])
Expansion to other Asian markets, and now broader international scope and US entry ("Spanish, French, several more languages coming this year" [14:37])
Revenue scale: "Well over 50 million ARR...mostly consumer." [14:37]
Recent B2B expansion: "It was like very much a side bet / experiment at first and then it just started working." [14:46]

Technical Deep Dives

6. Speech Technology and Product Craft

Built their own ASR and content pipeline before OpenAI/Whisper.
Dedicated on latency-sensitive, user-friendly speech recognition:
- "For the core recording loop in many of our lessons...it's extremely fast. So we're very latency sensitive." [04:58]
Spoke to the nuances of building multidisciplinary teams (consumer + ML—Slovenia engineering office!) [21:20]
Challenges of remote engineering culture early on.

7. The Whisper/LLM Inflection (2022)

Whisper as the 'superhuman' speech recognition threshold:
- "Whisper was really that magic moment for us...we got access to the model...we all closed our eyes and none of us had any idea. And the model got it right. So, I mean, superhuman." [23:38]
LLMs enabled richer, feedback-oriented tutoring vs. simple 'listen and repeat.'
Continuous product evolution maps closely to advances in model capability:
- "As the frontier of model intelligence improved, it would just unlock things on our roadmap that were locked before..." [26:21]
Active exploration of real-time voice tutors and more immersive role-play.

8. Scaling Across Languages & Content Generation

Growing from English/Korean pairing to "40 more countries." [13:55]
AI-generated curriculum: "[We] have a tutor agent, we have a curriculum writing agent, we have a giant LM-based pipeline..." [29:09]
Still important to keep human review in the loop. "[AI] allows you to do a hundred x in the same amount of time. We still need human review...but the hope is that this will allow us to launch a hundred x more courses." [35:00]

Notable Segment:

Speak’s foundation on real-world, functional fluency, not test-passing or textbook language:
- "We don't teach vocabulary and grammar, we teach sentence patterns and we try to get you to just repeat and drill..." [11:00]
- Focus on current, conversational language, not just standard or "textbook" English. [35:38]

9. Measuring and Personalizing Fluency

Building a multidimensional, knowledge-graph-based user model.
Goal: A holistic "Speak Score" reflecting practical speaking ability (not just theoretical knowledge).
"The idea is that eventually...everything will fold up into a number that we call the Speak score. That is a very sort of holistic measure of just like, how good are you at Spanish?" [31:38]
Product structure allows for personalized learning paths, adjusting to user strengths and weaknesses.[56:11]

10. Technical Constraints: Real-Time, Accent, Multimodal Models

Real-time roleplays require careful design for cost/hardware scaling; latency is a primary technical constraint [47:57].
Accent and pronunciation: Balance between encouraging users to speak (even with mistakes) and providing targeted accent/pronunciation coaching. [37:47]
Discussed future potential of multimodal tutors, with dynamically generated audio/video/UI and context-aware teaching experiences. [44:07]

Memorable Quotes & Moments

"If you talk to all of our users in Asia, they don't want a translator. The reason they are trying to learn English is to make themselves a better person, to connect with other people." – Andrew, [15:23]
"The first interaction when you have a fresh open of the app should feel pretty futuristic. It should feel like, okay, this is like the new AI native next-gen way of learning." – Andrew, [07:37]
"Our main teacher in the app is like a mini celebrity. People come up to her on the street as she's just walking around Seoul and recognize her from the app, which is really cool." – Andrew, [57:28]
On knowledge graphs: “[It’s] a bit more custom than that because it's a bit more domain specific around the way that we conceptualize the vocabulary, you know, and the sentence patterns and so on. So it's more specifically around like language learning concepts, if you will.” — Andrew, [54:38]

Notable Timestamps

00:16 – Andrew discusses Teal Fellowship & early founder story
02:17 – Speak's AI vision and genesis
06:25 – Personalized “magic onboarding” and conversational UX
13:55 – Speak's scale and growth metrics in Asia and revenue
14:46 – Launch and success of B2B side
23:38 – Whisper model release as an inflection point
29:09 – Large-scale AI content/curriculum generation
31:38 – "Speak score" and multidimensional assessment
35:00 – AI as leverage for human content teams
37:47 – Accent, pronunciation training, and language authenticity
44:07 – Future of multimodal, real-time tutors and UI evolution
47:57 – Real-time product, technical constraints, latency, inference costs
56:11 – Knowledge graphs for personalized learning in any domain

Final Reflections

Andrew believes "AI will reinvent how people learn anything," not just language. [27:44]
Despite rapid progress, real-world adoption is slower than expected: "If you go to another state outside of the Bay Area...how much their life has materially changed, it's like pretty close to zero. Real world inertia is enormous." [62:13]
Encourages more builders to create new, native AI-powered consumer experiences.
Speak’s future: Scaling globally, broadening into more domains, but always prioritizing learner agency, privacy, and connection.

For detailed notes and more episodes: Latent Space podcast

Podcast Summary: Personalized AI Language Education — with Andrew Hsu, Speak

Latent Space: The AI Engineer Podcast
Date: July 11, 2025
Host(s): Alessio (A), Wicks (B)
Guest: Andrew Hsu (C), CTO & Co-founder of Speak

Episode Overview

The genesis, growth, and product philosophy behind Speak
Technical and product challenges scaling AI-powered language learning
The role of speech and real-time interaction in modern edtech
Personal stories, global expansion, and what the future holds for learning with AI

Key Discussion Points & Insights

1. Founding Story and Early Days of Speak

Andrew’s background as a Teal Fellow (00:16)
- Met his co-founder through the fellowship.
- "For me it was life changing. I had a very unusual path where I did finish college...I was 19 at the time and in grad school..." [00:23-01:22]
Inspiration: "We were just so convinced, fundamentally, that speech models were going like this, language models were going like this, and in the five to ten year span they would become superhuman." [02:17]
Early vision: Build a pure software, AI-powered language tutor—no human in the loop.

2. Product Focus – Speaking as the Core Modality

Speak’s focus is on speech and real conversational fluency, not passive forms.
Early pain points and perseverance: "It took much, much longer than we expected to build a great product and find good PMF. The first few years were very painful and I think without this really compelling vision of the future, we would have quit." [02:17-03:25]
Never pivoted from the original vision.

3. Platform Evolution and Onboarding

Custom speech recognition models before LLMs/Whisper (pre-2022).
Their "magic onboarding" leverages conversational AI for personalized onboarding:
- "We wanted it to feel more like you were talking with a tutor...we would use that later to personalize the experience." [06:25]
Tradeoff: Speaking is a higher barrier but creates more engaged trial users.

Notable:

Learning from both state machines and open-ended LLM prompt approaches in onboarding.

4. AI-Native Approach and Differentiation

Gen 3 language learning: From Rosetta Stone (Gen 1) to Duolingo (Gen 2, game-like) to Speak’s LLM-powered, role-play practice (Gen 3).
"LLMs and AI now enable Gen 3 of language learning, which is something that is very AI native, very focused on functional fluency..." [11:00]
Users want real-world fluency—"We try to get you to just repeat and drill and drill and drill, almost like you're in a gym until it's automatic, because that's what speaking is, right?" [11:00]

5. Market Strategy and Growth

Initial (and still dominant) success in South Korea (largest English app: "I think like, 6% of the Korean population has tried us..." [13:55])
Expansion to other Asian markets, and now broader international scope and US entry ("Spanish, French, several more languages coming this year" [14:37])
Revenue scale: "Well over 50 million ARR...mostly consumer." [14:37]
Recent B2B expansion: "It was like very much a side bet / experiment at first and then it just started working." [14:46]

Technical Deep Dives

6. Speech Technology and Product Craft

Built their own ASR and content pipeline before OpenAI/Whisper.
Dedicated on latency-sensitive, user-friendly speech recognition:
- "For the core recording loop in many of our lessons...it's extremely fast. So we're very latency sensitive." [04:58]
Spoke to the nuances of building multidisciplinary teams (consumer + ML—Slovenia engineering office!) [21:20]
Challenges of remote engineering culture early on.

7. The Whisper/LLM Inflection (2022)

Whisper as the 'superhuman' speech recognition threshold:
- "Whisper was really that magic moment for us...we got access to the model...we all closed our eyes and none of us had any idea. And the model got it right. So, I mean, superhuman." [23:38]
LLMs enabled richer, feedback-oriented tutoring vs. simple 'listen and repeat.'
Continuous product evolution maps closely to advances in model capability:
- "As the frontier of model intelligence improved, it would just unlock things on our roadmap that were locked before..." [26:21]
Active exploration of real-time voice tutors and more immersive role-play.

8. Scaling Across Languages & Content Generation

Growing from English/Korean pairing to "40 more countries." [13:55]
AI-generated curriculum: "[We] have a tutor agent, we have a curriculum writing agent, we have a giant LM-based pipeline..." [29:09]
Still important to keep human review in the loop. "[AI] allows you to do a hundred x in the same amount of time. We still need human review...but the hope is that this will allow us to launch a hundred x more courses." [35:00]

Notable Segment:

Speak’s foundation on real-world, functional fluency, not test-passing or textbook language:
- "We don't teach vocabulary and grammar, we teach sentence patterns and we try to get you to just repeat and drill..." [11:00]
- Focus on current, conversational language, not just standard or "textbook" English. [35:38]

9. Measuring and Personalizing Fluency

Building a multidimensional, knowledge-graph-based user model.
Goal: A holistic "Speak Score" reflecting practical speaking ability (not just theoretical knowledge).
"The idea is that eventually...everything will fold up into a number that we call the Speak score. That is a very sort of holistic measure of just like, how good are you at Spanish?" [31:38]
Product structure allows for personalized learning paths, adjusting to user strengths and weaknesses.[56:11]

10. Technical Constraints: Real-Time, Accent, Multimodal Models

Real-time roleplays require careful design for cost/hardware scaling; latency is a primary technical constraint [47:57].
Accent and pronunciation: Balance between encouraging users to speak (even with mistakes) and providing targeted accent/pronunciation coaching. [37:47]
Discussed future potential of multimodal tutors, with dynamically generated audio/video/UI and context-aware teaching experiences. [44:07]

Memorable Quotes & Moments

"If you talk to all of our users in Asia, they don't want a translator. The reason they are trying to learn English is to make themselves a better person, to connect with other people." – Andrew, [15:23]
"The first interaction when you have a fresh open of the app should feel pretty futuristic. It should feel like, okay, this is like the new AI native next-gen way of learning." – Andrew, [07:37]
"Our main teacher in the app is like a mini celebrity. People come up to her on the street as she's just walking around Seoul and recognize her from the app, which is really cool." – Andrew, [57:28]
On knowledge graphs: “[It’s] a bit more custom than that because it's a bit more domain specific around the way that we conceptualize the vocabulary, you know, and the sentence patterns and so on. So it's more specifically around like language learning concepts, if you will.” — Andrew, [54:38]

Notable Timestamps

00:16 – Andrew discusses Teal Fellowship & early founder story
02:17 – Speak's AI vision and genesis
06:25 – Personalized “magic onboarding” and conversational UX
13:55 – Speak's scale and growth metrics in Asia and revenue
14:46 – Launch and success of B2B side
23:38 – Whisper model release as an inflection point
29:09 – Large-scale AI content/curriculum generation
31:38 – "Speak score" and multidimensional assessment
35:00 – AI as leverage for human content teams
37:47 – Accent, pronunciation training, and language authenticity
44:07 – Future of multimodal, real-time tutors and UI evolution
47:57 – Real-time product, technical constraints, latency, inference costs
56:11 – Knowledge graphs for personalized learning in any domain

Final Reflections

Andrew believes "AI will reinvent how people learn anything," not just language. [27:44]
Despite rapid progress, real-world adoption is slower than expected: "If you go to another state outside of the Bay Area...how much their life has materially changed, it's like pretty close to zero. Real world inertia is enormous." [62:13]
Encourages more builders to create new, native AI-powered consumer experiences.
Speak’s future: Scaling globally, broadening into more domains, but always prioritizing learner agency, privacy, and connection.

For detailed notes and more episodes: Latent Space podcast

Personalized AI Language Education — with Andrew Hsu, Speak

Powered by Wave AI

Summary

Podcast Summary: Personalized AI Language Education — with Andrew Hsu, Speak

Episode Overview

Key Discussion Points & Insights

1. Founding Story and Early Days of Speak

2. Product Focus – Speaking as the Core Modality

3. Platform Evolution and Onboarding

Notable:

4. AI-Native Approach and Differentiation

5. Market Strategy and Growth

Technical Deep Dives

6. Speech Technology and Product Craft

7. The Whisper/LLM Inflection (2022)

8. Scaling Across Languages & Content Generation

Notable Segment:

9. Measuring and Personalizing Fluency

10. Technical Constraints: Real-Time, Accent, Multimodal Models

Memorable Quotes & Moments

Notable Timestamps

Final Reflections

Summary

Podcast Summary: Personalized AI Language Education — with Andrew Hsu, Speak

Episode Overview

Key Discussion Points & Insights

1. Founding Story and Early Days of Speak

2. Product Focus – Speaking as the Core Modality

3. Platform Evolution and Onboarding

Notable:

4. AI-Native Approach and Differentiation

5. Market Strategy and Growth

Technical Deep Dives

6. Speech Technology and Product Craft

7. The Whisper/LLM Inflection (2022)

8. Scaling Across Languages & Content Generation

Notable Segment:

9. Measuring and Personalizing Fluency

10. Technical Constraints: Real-Time, Accent, Multimodal Models

Memorable Quotes & Moments

Notable Timestamps

Final Reflections