Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street) - Tech Brew Ride Home

Transcript

Brian McCullough (0:04)

Welcome to the Techmemi write home for Monday, January 27th, 2025. I'm Brian McCullough today it's one of those days where there's only one story. Maybe you saw that tech stocks got obliterated today. I'm here to tell you why. It's solely because of Deep Seq and Chinese AI tech generally, how this tech is making people think twice about the AI boom. What deepseek did that is different and how this could affect all of Silicon Valley. Here's what you missed today in the world of tech. Here's why the stock market is having a bit of a crash this morning. Nvidia down more than 8%. Meta and Microsoft both down ASML down almost 10%. Japanese chip companies falling. Crypto also falling. It's all because of Deep Seq. We spoke about Deep Seq last month with Simon Willison. To sum it up most succinctly, Deep Seq was apparently able to train an AI model at 3% of the cost of cutting edge models from the likes of OpenAI. So why do you need to buy 100,000 H1 hundreds from Nvidia when maybe you only need 3,000? See a lot of the eye popping capex spending from the likes of every tech player in the world was predicated on the idea of scale. The only way to get smarter AI was to throw more and more compute at it, which meant more and more GPUs and more and more data centers. I mean this was the whole premise behind the Stargate announcement. But what this is making people think, what if that is no longer true? Then all of this spending could be pulled back all at once and thus crash. But not only that, DeepSeek has jumped to the top of the app store charts. It's suddenly seeing rapid adoption in the AI community Deepse Chinese tech. And not only that, it's open source tech, not proprietary. If this cheaper tech which is open source is just as good, then that would mean that the ginormous valuations for the likes of OpenAI and Anthropic and the rest might not be warranted, suggesting a bubble would pop in VC funding. While it remains to be seen if Deep Seq will prove to be a viable cheaper alternative in the long term, initial worries are centered on whether US tech giants pricing power is being threatened and if their massive AI spending needs reevaluation, said Jun Rong Yip of IG Asia that a small and efficient AI model emerged from China which has been subject to escalating U.S. trade sanctions on advanced Nvidia chips is also challenging the effectiveness of such measures. Certainly US tech players seem to be taking this seriously. Marc Andreessen called Deepseek one of the most amazing and impressive breakthroughs, and Meta has reportedly set up four war rooms to analyze DeepSeek's tech two focusing on how High Flyer cut training costs and one on what data High Flyer might have used. But back to the stock market fallout, quoting the Financial Times. It's Deep Seek for sure, said one Tokyo based fund manager of the selling on Monday, adding that investor were rapidly assessing whether hardware spending on AI could ultimately be a lot lower than current estimates. AI investment by large cap US tech companies hit $224 billion last year, according to UBS, which expects the total to reach $280 billion this year. OpenAI and SoftBank announced last week a plan to invest $500 billion over the next four years in AI infrastructure. End quote. That's a ton of very stimulative spending in the economy that could again potentially dry up if the status quo is upended again. Who or what is Deep Seek, the single AI model that is crashing the stock market and roiling Silicon Valley? Deepseek is a Chinese AI lab that started as a deep learning research branch of Chinese quant hedge fund High Flyer. They've released several different models, all of which seem to be just as capable as the highest end AI models produced by the recent flurry of Western AI startups. Again, crucially, while all while their models seem to be cutting edge, their costs in terms of money and compute needed to train their models is believed to be a fraction of what Western models cost. One model reportedly cost $6 million to train, as opposed to the hundreds of millions of dollars that has become table stakes for other AI tech. Now, this has not been without controversy. The assumption is that these Chinese models, along with others from the likes of ByteDance, which have shown similar cost versus performance improvements, were able to make this breakthrough because US LED export controls over GPUs and other technology may have spurred Deepseek to innovate and release its models without in other words, they engineered their way around the roadblocks put up to slow them down, necessity being the mother of invention, or at least innovation around efficiency. In this case, though some have also suggested they might have copied the works of others. For example, Deep Seq v3 sometimes identifies itself as ChatGPT when asked which model it is, leading some to speculate that its training data sets may contain text generated by ChatGPT. There are also censorship concerns. DeepSeek's latest AI model, R1, seems to stick to Chinese government restrictions on sensitive topics like Tiananmen Square, Taiwan and the treatment of Uyghurs in China. But with Deepseek apps topping the app stores, the suggestion is that none of this may matter. The AI community could naturally gravitate toward using models that are far cheaper to operate. Quoting VentureBeat the implications for enterprise AI strategies are profound. With reduced costs and open access, enterprises now have an alternative to costly proprietary models like OpenAI's. DeepSEEK's release could democratize access to cutting edge AI capabilities, enabling smaller organizations to compete effectively in the AI arms race. Why is this happening, having such an impact on people's assumptions? Let's use Nvidia as the prime example of the potential implications here. Quoting Jeffrey Emanuel Perhaps Most devastating is DeepSeq's recent efficiency breakthrough, achieving comparable model performance at approximately 1/45 the compute cost. This suggests the entire industry has been massively over provisioning compute resources. Combined with the emergence of more efficient inference architectures through chain of thought models, the aggregate demand for compute could be significantly lower than the current projections. Assume the economics are compelling. When DeepSeq can match GPT4 level performance while charging 95% less for API calls, it suggests either Nvidia's customers are burning cash unnecessarily or margins must come down dramatically. The fact that TSMC will manufacture competitive chips for any well funded customer puts a natural ceiling on Nvidia's architectural advantages. But more fundamentally, history shows that markets eventually find a way around artificial bottlenecks that generate super normal Prof. Say goodbye to weak erections with Joy Mode's Sexual Performance Booster. This all natural supplement is designed to enhance blood flow, giving you firmer, more reliable performance. Joy Mode features just four simple, proven ingredients. Unlike prescription options that involve doctor visits and managing refills, Joy Mode provides a straightforward, effective solution. Simply mix a pack with water and feel the effects within 45 minutes. But Joy Mode isn't just about better sex daily use supports healthier blood vessels, boosts heart health and enhances your athletic performance. Join over 200,000 men who trust Joy Mode to boost their performance without the side effects. Start a subscription and save up to 30%. Keep your performance at its best without any interruptions. Take control with Joy Mode Get a boost anytime, anywhere and never miss a beat. If you are looking to take your game to the next level, visit tryjoymode.com and use code RIOT checkout for 20% off single purchases and 30% off subscription orders. That's T R Y J O-Y-M-O-E.com and use code RIDE for 20% off single purchases, and 30% off subscription orders. I like having a secret weapon when I go into some sort of a business negotiation situation. An ace in the hole if you will. An advantage in my back pocket. That's how Mack Weldon thinks about clothing as a secret weapon. Timeless classic style that's infused with performance fabrics and hidden details to give you secret confidence in how you look Look Mack Weldon has become my go to business attire. Some guys just want to look good without calling attention to themselves. Mack Weldon Apparel gives you understated good looks for understated confidence. They're not flashy, just classic. Always in style and made from the world's most comfortable performance materials, Mack Weldon clothes are designed to fit your style and the demands of modern life. They look like regular clothes, but feel like the latest in modern comfort. They're the go to choice for guys who want to look great without even trying. Get timeless looks with modern comfort from Mack weldon. Go to mackweldon.com and get 25% off your first order of $125 or more with promo code Brian that is M A C K W-E-L-O-O.com promo code B R I A N But how exactly did Deepseek outpace, OpenAI and others at a fraction of the cost? First open source, as we've been saying, but other details like quoting VentureBeat with Monday's full release of R1 and the accompanying technical paper, the company revealed a surprising innovation, a deliberate departure from the conventional supervised fine tuning or sft, process widely used in training large language models. Sft, a standard step in AI development, involves training models on curated data sets to teach step by step reasoning, often referred to as chain of thought or cot. It is considered essential for improving reasoning capabilities. However, Deep SEQ challenged this assumption by skipping SFT entirely, optimized instead to rely on reinforcement learning RL to train the model. This bold move forced deep seq R1 to develop independent reasoning abilities, avoiding the brittleness often introduced by prescriptive data sets. While some flaws emerge, leading the team to reintroduce a limited amount of SFT during the final stages of building the model. The results confirm the fundamental breakthrough reinforcement learning alone could drive substantial performance gains. Little is known about the company's exact approach, but it quickly open sourced its models, and it's extremely likely that the company built upon the open projects produced by Meta, for example the Llama model and ML library Pytorch. To train its models, High Flyer Quant secured over 10,000 Nvidia GPUs before US export restrictions and reportedly expanded to 50,000 GPUs through alternative supply routes despite trade barriers. This pales compared to leading AI labs like OpenAI, Google and Anthropic, which operate with more than 500,000 GPUs each. The journey to Deep seq R1's final iteration began with an intermediate model, Deep seq R10, which was trained using pure reinforcement learning. By relying solely on rl, Deep Seq incentivize this model to think independently, rewarding both correct answers and the logical processes used to arrive at them. This approach led to an unexpected phenomenon. The model began allocating additional processing time to more complex problems, demonstrating an ability to prioritize tasks based on their difficulty. Deep seq's researchers described this as an aha moment where the model itself identified and articulated novel solutions to challenging problems. This milestone underscored the power of reinforcement learning to unlock advanced reasoning capabilities without relying on traditional training methods like sft. End quote and more from Jeffrey Emanuel Quote Deep SEQ has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, Deep SEQ was able to train these incredible models using GPUs in a dramatically more efficient way. How in the world could this be possible? How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital, GPUs, et cetera? Wasn't China supposed to be crippled by Biden's restrictions on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of deepsea was the critical ingredient to make them more creative and clever. Necessity being the mother of invention at all. A major innovation is their sophisticated mixed precision training framework that lets them use 8 bit floating point numbers FP8 throughout the entire training process. Most western labs train using full precision 32 bit numbers. This basically specifies the number of gradations possible in describing the output of an artificial neuron. 8 bits in FP8 lets you store a much wider range of numbers than you might expect. It's just not limited to 256 different equal size magnitudes like you'd get with regular integers, but instead uses clever math tricks to store both very small and very large numbers, though naturally with less precision than you'd get with 32 bits. Deepseek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later, losing some quality in the process, DeepSeq's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs. This dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall. Another major breakthrough is their multi token prediction system. Most transformer based LLM models do inference by predicting the next token, one token at a time. Deepseek figured out how to predict multiple tokens while maintaining the quality you'd get from single token prediction. Their approach achieves about 85 to 90% accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing, it's making structured contextual predictions. The brilliant part is this compression is built directly into how the model learns. It's not some separate step they need to do, it's built directly into the end to end training pipeline. This means that the entire mechanism is differentiable and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower dimensional representations of the underlying data than the so called ambient dimensions. So it's wasteful to store the full KV indices, even though that is basically what everyone else does. Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency. Again slashing the number of GPUs you need to train a world class model. But it can actually end up improving model quality because it can act like a regular, forcing the model to pay attention to the truly important stuff instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better. At the very least you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of trade off you are faced with in AI training. Another very smart thing they did is to use what is known as mixture of experts or MOE transformer architecture, but with key innovations around load balancing as you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model, either the the weight or importance a particular artificial neuron has relative to another one, or the importance of a particular token depending on its context and the attention mechanism, et cetera. Meta's latest LLAMA 3 model comes in a few sizes, for example a 1 billion parameter version, the smallest, a 70 billion parameter model, the most commonly deployed one, and even a massive 405 billion parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inferen least if you deployed it in the native full precision version. Therefore, most of the real world usage and excitement surrounding these open source models is at the 8 billion parameter or highly quantized 70 billion parameter level, since that's what can fit in a consumer grade Nvidia 4090 GPU which you can buy now for under $1,000. So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability or the model's iq, if you will. It turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plane geometry, SAT math problems, etc. Okay, look, as I said, this whole day is about Deep seq. And here's more of why. Quoting Axios, this could be an extinction level event for venture capital firms that went all in on foundational model companies, particularly if those companies haven't yet productized with wide distribution. The quantums of capital are just so much more than anything VC has ever before dispersed based on what might be a suddenly stale thesis if nanotech and web3 were venture industry grenades, this could be a nuclear bomb. Investors I spoke to over the weekend aren't panicking, but they're clearly concerned, particularly that they could be taken so off guard. Don't be surprised if some deals in process get paused. There's still a ton we don't know about Deepseek, including if it really spent as little money as it claims. And obviously there could be national security impediments for US companies or consumers given what we've seen with TikTok. But bottom line. The game has changed. And finally, let's end with Joe Wiesenthal taking the contrarian view just a bit. That is maybe if AI is a race down to becoming a commodity, that could be a good thing. Suddenly everyone is talking about Jevons Paradox. This is usually discussed with respect to energy markets. Basically, when you get more energy efficient, you don't use less of the energy source, you just use your efficiency gains to do new things and demand keeps booming. This is certainly the hope if you're an Nvidia or any company that builds underlying AI infrastructure that everyone will use the deep SEQ breakthroughs and just race even faster with no effect on total demand for compute. We'll see. As I'm typing this, Nvidia has opened down about 13%. Certainly investors aren't taking much comfort in Jevons Paradox right now. One of my favorite Tracy Alloway lines is that it's only a crisis when you can't throw money at the problem. Covid was a crisis because money alone wasn't enough to address it. The supply chain shocks were a crisis because money alone couldn't fix the problem. There's no guarantee here that just throwing more money at us tech companies will be enough to keep them competitive in AI, let alone chips, if it's perceived that they're falling behind. Human capital talent takes years and years to develop. Getting the incentives right is not something where you can snap your fingers overnight and make things happen. These are big, slow moving things. Nothing more for you today. Talk to you tomorrow.

Techmeme Ride Home: Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street)
Release Date: January 27, 2025
Host: Brian McCullough

Introduction

In the January 27, 2025 episode of Techmeme Ride Home, host Brian McCullough delves into a seismic shift shaking Silicon Valley and Wall Street: the emergence of DeepSeek, a Chinese AI lab that has fundamentally altered the landscape of artificial intelligence (AI) development and investment. McCullough unpacks how DeepSeek's groundbreaking advancements are not only impacting stock markets but also challenging established norms in AI technology and funding.

DeepSeek's Impact on Silicon Valley

[00:04] Brian McCullough begins by addressing the day's singular dominating story: the dramatic decline in tech stocks, a phenomenon he attributes directly to DeepSeek and broader advancements in Chinese AI technology. McCullough explains that DeepSeek has introduced an AI model capable of training at merely 3% of the cost compared to leading-edge models from Western companies like OpenAI. This cost-efficiency undermines the previously held belief that scaling compute resources was the only path to more intelligent AI.

"What DeepSeek did that is different and how this could affect all of Silicon Valley." — Brian McCullough [00:04]

Stock Market Fallout

The financial repercussions of DeepSeek's innovations are evident as major tech stocks plummet:

Nvidia shares fell by over 8%
Meta and Microsoft also experienced declines
ASML saw a nearly 10% drop
Japanese chip companies and the cryptocurrency market are also in freefall

McCullough cites a Tokyo-based fund manager who, referencing the Financial Times, confirms DeepSeek as the primary catalyst for the sell-off, highlighting investor anxiety over the potential reduction in necessary hardware spending for AI development.

"It's Deep Seek for sure... investor were rapidly assessing whether hardware spending on AI could ultimately be a lot lower than current estimates." — Tokyo-based Fund Manager [Financial Times]

DeepSeek's Technological Breakthroughs

DeepSeek, originating from the Chinese quant hedge fund High Flyer, has released multiple AI models that rival Western counterparts in capability but at a fraction of the cost. The latest model, R1, was trained for just $6 million, starkly contrasting with the hundreds of millions invested by companies like OpenAI.

Innovations Driving Efficiency

Open Source Approach: Unlike proprietary models, DeepSeek's open-source nature allows broader access and rapid adoption within the AI community.
Reinforcement Learning (RL) Over Supervised Fine-Tuning (SFT): DeepSeek abandoned the conventional SFT process, relying solely on RL to foster independent reasoning in their models. This method not only reduced costs but also enhanced model robustness.
Mixed Precision Training: Utilizing 8-bit floating point numbers (FP8) instead of the standard 32-bit, DeepSeek achieved significant memory savings without compromising performance. This innovation allowed training on fewer GPUs, drastically reducing compute costs.
Multi-Token Prediction System: By predicting multiple tokens simultaneously with high accuracy (85-90%), DeepSeek doubled inference speed, enhancing overall efficiency.

"Deep SEQ has made profound advancements not just in model quality, but more importantly in model training and inference efficiency." — Jeffrey Emanuel

Implications for the AI Industry

DeepSeek's advancements pose significant challenges to established AI enterprises:

Cost Efficiency: With the ability to train models at 1/45th the compute cost, the necessity for exorbitant GPU investments is questioned.
Democratization of AI: Open-source, cost-effective models enable smaller organizations to compete, potentially leading to a more leveled playing field in AI development.
Valuation Challenges: High valuations of companies like OpenAI and Anthropic may be reevaluated, suggesting a possible VC funding bubble.

"If DeepSeq can match GPT4 level performance while charging 95% less for API calls, it suggests either Nvidia's customers are burning cash unnecessarily or margins must come down dramatically." — Jeffrey Emanuel

Industry Reactions

The tech industry is responding swiftly to DeepSeek's breakthroughs:

Marc Andreessen lauds DeepSeek as "one of the most amazing and impressive breakthroughs."
Meta has established four war rooms to dissect DeepSeek's technology, focusing on cost-cutting training methods and data utilization.

Despite initial excitement, concerns loom over potential national security implications and censorship issues related to DeepSeek's adherence to Chinese government restrictions on sensitive topics.

Venture Capital Concerns

The episode highlights a growing unease within the venture capital community:

Extinction-Level Event: According to Axios, DeepSeek's rise could spell disaster for VC firms heavily invested in foundational AI models, reminiscent of an "extinction-level" event.
Paused Deals: While not in a state of panic, investors are apprehensive, with ongoing deals potentially being halted as capital reassessment occurs.

"This could be an extinction level event for venture capital firms that went all in on foundational model companies." — Axios

Contrarian Perspectives

Not all viewpoints are pessimistic. Joe Wiesenthal offers a contrarian take, suggesting that as AI becomes a commodity, it could lead to widespread benefits:

Jevons Paradox: Drawing an analogy from energy markets, Wiesenthal posits that increased efficiency in AI could lead to greater overall demand rather than a reduction, implying continued growth in compute needs.

"There’s no guarantee here that just throwing more money at us tech companies will be enough to keep them competitive in AI, let alone chips." — Tracy Alloway

Conclusion

The emergence of DeepSeek marks a pivotal moment in the AI industry, challenging established paradigms of cost, scalability, and technological supremacy. As Silicon Valley grapples with the implications—from stock market volatility to potential shifts in venture capital dynamics—the long-term sustainability and impact of DeepSeek's innovations remain to be seen. Will this breakthrough democratize AI access and drive further innovation, or will it signify the precipice of a fundamental market realignment? Only time will tell.

Note: This summary excludes the non-content sections of the podcast, including advertisements for products like Joy Mode and Mack Weldon.

wavePod

Mon. 01/27 – Why DeepSeek Has Stunned Silicon Valley (And Wall Street)

Summary

Introduction

DeepSeek's Impact on Silicon Valley

Stock Market Fallout

DeepSeek's Technological Breakthroughs

Innovations Driving Efficiency

Implications for the AI Industry

Industry Reactions

Venture Capital Concerns

Contrarian Perspectives

Conclusion

Transcript

Summary

Introduction

DeepSeek's Impact on Silicon Valley

Stock Market Fallout

DeepSeek's Technological Breakthroughs

Innovations Driving Efficiency

Implications for the AI Industry

Industry Reactions

Venture Capital Concerns

Contrarian Perspectives

Conclusion