Artificial Intelligence Masterclass

Episode: 5 Tips and Misconceptions about Finetuning GPT-3

Host: David Shapiro
Date: December 31, 2024

Overview

In this episode, David Shapiro demystifies the process of finetuning GPT-3, sharing five core tips and addressing common misconceptions. He emphasizes the power and versatility of GPT-3’s base models and advocates for pragmatic, thoughtful adoption of finetuning—only when absolutely necessary. Drawing from hands-on experience and community insights, Shapiro’s tone is conversational, occasionally humorous, and always focused on practical utility. He advocates for multidisciplinary teamwork and highlights the importance of language and creativity in leveraging large language models.

Key Discussion Points and Insights

1. Start with “Plain Vanilla” GPT-3 (Prompt Engineering First)

Timestamp: [01:31 – 05:00]
Shapiro advises against immediately jumping into finetuning. Instead, he urges listeners to first master basic prompt engineering.
- “GPT3 is not like any other tool you’ve ever used, NLP or otherwise. It’s not like an SVM, it’s not like a regression model. It’s not even like other neural networks.” [01:54]
- Many users underestimate GPT-3’s out-of-the-box capabilities and overestimate the need for specialized data or finetuning.
- Example: Instead of writing code to scrape dates from text, just ask GPT-3 directly for the dates.
- Encourages adopting a mindset shift from traditional machine learning approaches: “Some people just, they have their, their old school data science mindset, like, oh, I need…data set. I need to, you know, come up with rules. And I’m like, just ask it.” [02:13]
Highlight:
- Subtitle: Team Composition Matters
- Shapiro recommends building teams that include people with strong language backgrounds (e.g., English majors, philosophers), not just computer scientists.
  - “If you want to have a dynamite team using large language models, make sure you’ve got someone who understands language on your team. Maybe hire a librarian, an English major, a philosopher.” [04:21]
- Observes a notable split: those with humanities backgrounds “get it” faster because they think in language and narrative.

2. Building Finetuning Datasets Is 100x More Effort Than Prompt Engineering

Timestamp: [05:00 – 06:30]
Creating datasets for finetuning is time-consuming and labor-intensive relative to prompt engineering.
- “Building fine tuning data sets is 100 times more effort than prompt engineering. For that reason alone, start with plain vanilla GPT3. It’ll carry you way farther than you think it will.” [05:10]
Shapiro recommends: Only consider finetuning after extensive experience with prompt engineering—and even then, question if it’s truly necessary.

3. Use Natural Language Separators for Demarcation

Timestamp: [06:30 – 08:56]
When building finetuning datasets, use clear, meaningful natural language to indicate where the prompt ends and the response begins.
- “In the OpenAI documentation, they just use like hashtag, hashtag, hashtag. And while that can work, it’s semantically meaningless. So what I usually do is I will add like just a couple words ... that helps teach GPT3 1 what its task actually is.” [06:50]
Benefits:
- Explicit, human-readable instructions help GPT-3 more easily understand and adapt to multiple tasks within the same dataset.
- Allows for dynamic task switching at inference time without needing multiple finetuned models: “By using a natural language separator… you can actually switch tasks without having to switch between different fine tuned models.” [08:28]

4. Leverage GPT-3 Itself to Generate Synthetic Data

Timestamp: [08:56 – 13:00]
Utilize GPT-3 to synthesize example data for finetuning (e.g., simulate conversations).
- “It takes me an hour flat to make a new fine tuning data set because it’s just a couple of scripts to take in some raw material and massage those into prompts ... and then you’re ready to go.” [11:37]
Clarifies common misconception: massive datasets are unnecessary.
- “There was one person on the forum who thought that he needed 200,000 samples to fine tune GPT3 for a good chatbot. I said no, you need 200, not 200,000.” [12:14]
- In real use, “It’s like 18 cents to fine tune with 200 samples... if you use DaVinci it’s $1.80.” [13:53]
- Most tasks—chatbots, Q&A—work with small (e.g., 200 samples) datasets if well designed and cleaned.
Publicly shares his datasets for community testing and improvement.

5. Finetuning Increases Consistency but Reduces Creativity

Timestamp: [13:53 – End (~15:00)]
Finetuning models for a particular style or perspective increases output consistency, but at the cost of creative flexibility.
- “Plain vanilla GPT3 is really creative. It’s more creative than any 10 humans. But fine tuning… it’ll kind of lose those other perspectives.” [14:32]
- For creative and open-ended tasks, prompt engineering is preferred.
- For tasks requiring rigid consistency or fixed formats, finetuning is the best approach.

Notable Quotes and Memorable Moments

On team composition:
“So, hire humanities.” [04:37]
On prompt engineering vs. traditional ML:
“You don’t have to worry about the math…honestly, [CS algorithmic thinking] won’t help you with using GPT3.” [03:48]
On data requirements:
“You need 200, not 200,000.” [12:20]
On model behavior:
“Fine tuning tends to increase consistency at the cost of creativity…so just keep that in mind.” [14:41]

Timestamps for Important Segments

[01:31] — Episode intro and core premise
[02:00] — Why prompt engineering comes first
[04:20] — Humanities vs. CS mindsets in GPT-3 usage
[05:10] — The true cost and effort of dataset creation
[06:50] — Semantic demarcation in dataset design
[08:30] — Task separation during inference
[11:37] — Using GPT-3 to generate datasets and share results
[12:14] — Real dataset size and cost breakdown
[14:32] — Trade-offs: consistency vs. creativity

Summary Takeaways

Start with the default GPT-3 and leverage prompt engineering to its fullest—only consider finetuning after reaching its limits.
Prioritize team members who understand language; humanities and philosophy backgrounds add unique strengths.
Use meaningful language separators in datasets to help models distinguish tasks and instructions.
Exploit GPT-3’s generative powers to synthesize datasets. Small, well-crafted datasets (circa 200 samples) often suffice.
Recognize the trade-off: finetuning enhances reliability and consistency but narrows creative response diversity.

For practical examples and public datasets referenced by David Shapiro, check the episode description for resources and repository links.