Podcast Summary: Personalized AI Language Education — with Andrew Hsu, Speak
Latent Space: The AI Engineer Podcast
Date: July 11, 2025
Host(s): Alessio (A), Wicks (B)
Guest: Andrew Hsu (C), CTO & Co-founder of Speak
Episode Overview
This episode explores how foundation AI models are revolutionizing language learning, featuring Andrew Hsu of Speak—one of the leading AI-powered language education companies. Andrew shares the story of building Speak, its technical and product evolution, and broader reflections on the future of human learning with AI. In a lively and candid conversation, the hosts delve into:
- The genesis, growth, and product philosophy behind Speak
- Technical and product challenges scaling AI-powered language learning
- The role of speech and real-time interaction in modern edtech
- Personal stories, global expansion, and what the future holds for learning with AI
Key Discussion Points & Insights
1. Founding Story and Early Days of Speak
- Andrew’s background as a Teal Fellow (00:16)
- Met his co-founder through the fellowship.
- "For me it was life changing. I had a very unusual path where I did finish college...I was 19 at the time and in grad school..." [00:23-01:22]
- Inspiration: "We were just so convinced, fundamentally, that speech models were going like this, language models were going like this, and in the five to ten year span they would become superhuman." [02:17]
- Early vision: Build a pure software, AI-powered language tutor—no human in the loop.
2. Product Focus – Speaking as the Core Modality
- Speak’s focus is on speech and real conversational fluency, not passive forms.
- Early pain points and perseverance: "It took much, much longer than we expected to build a great product and find good PMF. The first few years were very painful and I think without this really compelling vision of the future, we would have quit." [02:17-03:25]
- Never pivoted from the original vision.
3. Platform Evolution and Onboarding
- Custom speech recognition models before LLMs/Whisper (pre-2022).
- Their "magic onboarding" leverages conversational AI for personalized onboarding:
- "We wanted it to feel more like you were talking with a tutor...we would use that later to personalize the experience." [06:25]
- Tradeoff: Speaking is a higher barrier but creates more engaged trial users.
Notable:
- Learning from both state machines and open-ended LLM prompt approaches in onboarding.
4. AI-Native Approach and Differentiation
- Gen 3 language learning: From Rosetta Stone (Gen 1) to Duolingo (Gen 2, game-like) to Speak’s LLM-powered, role-play practice (Gen 3).
- "LLMs and AI now enable Gen 3 of language learning, which is something that is very AI native, very focused on functional fluency..." [11:00]
- Users want real-world fluency—"We try to get you to just repeat and drill and drill and drill, almost like you're in a gym until it's automatic, because that's what speaking is, right?" [11:00]
5. Market Strategy and Growth
- Initial (and still dominant) success in South Korea (largest English app: "I think like, 6% of the Korean population has tried us..." [13:55])
- Expansion to other Asian markets, and now broader international scope and US entry ("Spanish, French, several more languages coming this year" [14:37])
- Revenue scale: "Well over 50 million ARR...mostly consumer." [14:37]
- Recent B2B expansion: "It was like very much a side bet / experiment at first and then it just started working." [14:46]
Technical Deep Dives
6. Speech Technology and Product Craft
- Built their own ASR and content pipeline before OpenAI/Whisper.
- Dedicated on latency-sensitive, user-friendly speech recognition:
- "For the core recording loop in many of our lessons...it's extremely fast. So we're very latency sensitive." [04:58]
- Spoke to the nuances of building multidisciplinary teams (consumer + ML—Slovenia engineering office!) [21:20]
- Challenges of remote engineering culture early on.
7. The Whisper/LLM Inflection (2022)
- Whisper as the 'superhuman' speech recognition threshold:
- "Whisper was really that magic moment for us...we got access to the model...we all closed our eyes and none of us had any idea. And the model got it right. So, I mean, superhuman." [23:38]
- LLMs enabled richer, feedback-oriented tutoring vs. simple 'listen and repeat.'
- Continuous product evolution maps closely to advances in model capability:
- "As the frontier of model intelligence improved, it would just unlock things on our roadmap that were locked before..." [26:21]
- Active exploration of real-time voice tutors and more immersive role-play.
8. Scaling Across Languages & Content Generation
- Growing from English/Korean pairing to "40 more countries." [13:55]
- AI-generated curriculum: "[We] have a tutor agent, we have a curriculum writing agent, we have a giant LM-based pipeline..." [29:09]
- Still important to keep human review in the loop. "[AI] allows you to do a hundred x in the same amount of time. We still need human review...but the hope is that this will allow us to launch a hundred x more courses." [35:00]
Notable Segment:
- Speak’s foundation on real-world, functional fluency, not test-passing or textbook language:
- "We don't teach vocabulary and grammar, we teach sentence patterns and we try to get you to just repeat and drill..." [11:00]
- Focus on current, conversational language, not just standard or "textbook" English. [35:38]
9. Measuring and Personalizing Fluency
- Building a multidimensional, knowledge-graph-based user model.
- Goal: A holistic "Speak Score" reflecting practical speaking ability (not just theoretical knowledge).
- "The idea is that eventually...everything will fold up into a number that we call the Speak score. That is a very sort of holistic measure of just like, how good are you at Spanish?" [31:38]
- Product structure allows for personalized learning paths, adjusting to user strengths and weaknesses.[56:11]
10. Technical Constraints: Real-Time, Accent, Multimodal Models
- Real-time roleplays require careful design for cost/hardware scaling; latency is a primary technical constraint [47:57].
- Accent and pronunciation: Balance between encouraging users to speak (even with mistakes) and providing targeted accent/pronunciation coaching. [37:47]
- Discussed future potential of multimodal tutors, with dynamically generated audio/video/UI and context-aware teaching experiences. [44:07]
Memorable Quotes & Moments
- "If you talk to all of our users in Asia, they don't want a translator. The reason they are trying to learn English is to make themselves a better person, to connect with other people." – Andrew, [15:23]
- "The first interaction when you have a fresh open of the app should feel pretty futuristic. It should feel like, okay, this is like the new AI native next-gen way of learning." – Andrew, [07:37]
- "Our main teacher in the app is like a mini celebrity. People come up to her on the street as she's just walking around Seoul and recognize her from the app, which is really cool." – Andrew, [57:28]
- On knowledge graphs: “[It’s] a bit more custom than that because it's a bit more domain specific around the way that we conceptualize the vocabulary, you know, and the sentence patterns and so on. So it's more specifically around like language learning concepts, if you will.” — Andrew, [54:38]
Notable Timestamps
- 00:16 – Andrew discusses Teal Fellowship & early founder story
- 02:17 – Speak's AI vision and genesis
- 06:25 – Personalized “magic onboarding” and conversational UX
- 13:55 – Speak's scale and growth metrics in Asia and revenue
- 14:46 – Launch and success of B2B side
- 23:38 – Whisper model release as an inflection point
- 29:09 – Large-scale AI content/curriculum generation
- 31:38 – "Speak score" and multidimensional assessment
- 35:00 – AI as leverage for human content teams
- 37:47 – Accent, pronunciation training, and language authenticity
- 44:07 – Future of multimodal, real-time tutors and UI evolution
- 47:57 – Real-time product, technical constraints, latency, inference costs
- 56:11 – Knowledge graphs for personalized learning in any domain
Final Reflections
- Andrew believes "AI will reinvent how people learn anything," not just language. [27:44]
- Despite rapid progress, real-world adoption is slower than expected: "If you go to another state outside of the Bay Area...how much their life has materially changed, it's like pretty close to zero. Real world inertia is enormous." [62:13]
- Encourages more builders to create new, native AI-powered consumer experiences.
- Speak’s future: Scaling globally, broadening into more domains, but always prioritizing learner agency, privacy, and connection.
For detailed notes and more episodes: Latent Space podcast
