Transcript
Brian McCullough (0:04)
Welcome to the Techmemi write home for Monday, January 27th, 2025. I'm Brian McCullough today it's one of those days where there's only one story. Maybe you saw that tech stocks got obliterated today. I'm here to tell you why. It's solely because of Deep Seq and Chinese AI tech generally, how this tech is making people think twice about the AI boom. What deepseek did that is different and how this could affect all of Silicon Valley. Here's what you missed today in the world of tech. Here's why the stock market is having a bit of a crash this morning. Nvidia down more than 8%. Meta and Microsoft both down ASML down almost 10%. Japanese chip companies falling. Crypto also falling. It's all because of Deep Seq. We spoke about Deep Seq last month with Simon Willison. To sum it up most succinctly, Deep Seq was apparently able to train an AI model at 3% of the cost of cutting edge models from the likes of OpenAI. So why do you need to buy 100,000 H1 hundreds from Nvidia when maybe you only need 3,000? See a lot of the eye popping capex spending from the likes of every tech player in the world was predicated on the idea of scale. The only way to get smarter AI was to throw more and more compute at it, which meant more and more GPUs and more and more data centers. I mean this was the whole premise behind the Stargate announcement. But what this is making people think, what if that is no longer true? Then all of this spending could be pulled back all at once and thus crash. But not only that, DeepSeek has jumped to the top of the app store charts. It's suddenly seeing rapid adoption in the AI community Deepse Chinese tech. And not only that, it's open source tech, not proprietary. If this cheaper tech which is open source is just as good, then that would mean that the ginormous valuations for the likes of OpenAI and Anthropic and the rest might not be warranted, suggesting a bubble would pop in VC funding. While it remains to be seen if Deep Seq will prove to be a viable cheaper alternative in the long term, initial worries are centered on whether US tech giants pricing power is being threatened and if their massive AI spending needs reevaluation, said Jun Rong Yip of IG Asia that a small and efficient AI model emerged from China which has been subject to escalating U.S. trade sanctions on advanced Nvidia chips is also challenging the effectiveness of such measures. Certainly US tech players seem to be taking this seriously. Marc Andreessen called Deepseek one of the most amazing and impressive breakthroughs, and Meta has reportedly set up four war rooms to analyze DeepSeek's tech two focusing on how High Flyer cut training costs and one on what data High Flyer might have used. But back to the stock market fallout, quoting the Financial Times. It's Deep Seek for sure, said one Tokyo based fund manager of the selling on Monday, adding that investor were rapidly assessing whether hardware spending on AI could ultimately be a lot lower than current estimates. AI investment by large cap US tech companies hit $224 billion last year, according to UBS, which expects the total to reach $280 billion this year. OpenAI and SoftBank announced last week a plan to invest $500 billion over the next four years in AI infrastructure. End quote. That's a ton of very stimulative spending in the economy that could again potentially dry up if the status quo is upended again. Who or what is Deep Seek, the single AI model that is crashing the stock market and roiling Silicon Valley? Deepseek is a Chinese AI lab that started as a deep learning research branch of Chinese quant hedge fund High Flyer. They've released several different models, all of which seem to be just as capable as the highest end AI models produced by the recent flurry of Western AI startups. Again, crucially, while all while their models seem to be cutting edge, their costs in terms of money and compute needed to train their models is believed to be a fraction of what Western models cost. One model reportedly cost $6 million to train, as opposed to the hundreds of millions of dollars that has become table stakes for other AI tech. Now, this has not been without controversy. The assumption is that these Chinese models, along with others from the likes of ByteDance, which have shown similar cost versus performance improvements, were able to make this breakthrough because US LED export controls over GPUs and other technology may have spurred Deepseek to innovate and release its models without in other words, they engineered their way around the roadblocks put up to slow them down, necessity being the mother of invention, or at least innovation around efficiency. In this case, though some have also suggested they might have copied the works of others. For example, Deep Seq v3 sometimes identifies itself as ChatGPT when asked which model it is, leading some to speculate that its training data sets may contain text generated by ChatGPT. There are also censorship concerns. DeepSeek's latest AI model, R1, seems to stick to Chinese government restrictions on sensitive topics like Tiananmen Square, Taiwan and the treatment of Uyghurs in China. But with Deepseek apps topping the app stores, the suggestion is that none of this may matter. The AI community could naturally gravitate toward using models that are far cheaper to operate. Quoting VentureBeat the implications for enterprise AI strategies are profound. With reduced costs and open access, enterprises now have an alternative to costly proprietary models like OpenAI's. DeepSEEK's release could democratize access to cutting edge AI capabilities, enabling smaller organizations to compete effectively in the AI arms race. Why is this happening, having such an impact on people's assumptions? Let's use Nvidia as the prime example of the potential implications here. Quoting Jeffrey Emanuel Perhaps Most devastating is DeepSeq's recent efficiency breakthrough, achieving comparable model performance at approximately 1/45 the compute cost. This suggests the entire industry has been massively over provisioning compute resources. Combined with the emergence of more efficient inference architectures through chain of thought models, the aggregate demand for compute could be significantly lower than the current projections. Assume the economics are compelling. When DeepSeq can match GPT4 level performance while charging 95% less for API calls, it suggests either Nvidia's customers are burning cash unnecessarily or margins must come down dramatically. The fact that TSMC will manufacture competitive chips for any well funded customer puts a natural ceiling on Nvidia's architectural advantages. But more fundamentally, history shows that markets eventually find a way around artificial bottlenecks that generate super normal Prof. Say goodbye to weak erections with Joy Mode's Sexual Performance Booster. This all natural supplement is designed to enhance blood flow, giving you firmer, more reliable performance. Joy Mode features just four simple, proven ingredients. Unlike prescription options that involve doctor visits and managing refills, Joy Mode provides a straightforward, effective solution. Simply mix a pack with water and feel the effects within 45 minutes. But Joy Mode isn't just about better sex daily use supports healthier blood vessels, boosts heart health and enhances your athletic performance. Join over 200,000 men who trust Joy Mode to boost their performance without the side effects. Start a subscription and save up to 30%. Keep your performance at its best without any interruptions. Take control with Joy Mode Get a boost anytime, anywhere and never miss a beat. If you are looking to take your game to the next level, visit tryjoymode.com and use code RIOT checkout for 20% off single purchases and 30% off subscription orders. That's T R Y J O-Y-M-O-E.com and use code RIDE for 20% off single purchases, and 30% off subscription orders. I like having a secret weapon when I go into some sort of a business negotiation situation. An ace in the hole if you will. An advantage in my back pocket. That's how Mack Weldon thinks about clothing as a secret weapon. Timeless classic style that's infused with performance fabrics and hidden details to give you secret confidence in how you look Look Mack Weldon has become my go to business attire. Some guys just want to look good without calling attention to themselves. Mack Weldon Apparel gives you understated good looks for understated confidence. They're not flashy, just classic. Always in style and made from the world's most comfortable performance materials, Mack Weldon clothes are designed to fit your style and the demands of modern life. They look like regular clothes, but feel like the latest in modern comfort. They're the go to choice for guys who want to look great without even trying. Get timeless looks with modern comfort from Mack weldon. Go to mackweldon.com and get 25% off your first order of $125 or more with promo code Brian that is M A C K W-E-L-O-O.com promo code B R I A N But how exactly did Deepseek outpace, OpenAI and others at a fraction of the cost? First open source, as we've been saying, but other details like quoting VentureBeat with Monday's full release of R1 and the accompanying technical paper, the company revealed a surprising innovation, a deliberate departure from the conventional supervised fine tuning or sft, process widely used in training large language models. Sft, a standard step in AI development, involves training models on curated data sets to teach step by step reasoning, often referred to as chain of thought or cot. It is considered essential for improving reasoning capabilities. However, Deep SEQ challenged this assumption by skipping SFT entirely, optimized instead to rely on reinforcement learning RL to train the model. This bold move forced deep seq R1 to develop independent reasoning abilities, avoiding the brittleness often introduced by prescriptive data sets. While some flaws emerge, leading the team to reintroduce a limited amount of SFT during the final stages of building the model. The results confirm the fundamental breakthrough reinforcement learning alone could drive substantial performance gains. Little is known about the company's exact approach, but it quickly open sourced its models, and it's extremely likely that the company built upon the open projects produced by Meta, for example the Llama model and ML library Pytorch. To train its models, High Flyer Quant secured over 10,000 Nvidia GPUs before US export restrictions and reportedly expanded to 50,000 GPUs through alternative supply routes despite trade barriers. This pales compared to leading AI labs like OpenAI, Google and Anthropic, which operate with more than 500,000 GPUs each. The journey to Deep seq R1's final iteration began with an intermediate model, Deep seq R10, which was trained using pure reinforcement learning. By relying solely on rl, Deep Seq incentivize this model to think independently, rewarding both correct answers and the logical processes used to arrive at them. This approach led to an unexpected phenomenon. The model began allocating additional processing time to more complex problems, demonstrating an ability to prioritize tasks based on their difficulty. Deep seq's researchers described this as an aha moment where the model itself identified and articulated novel solutions to challenging problems. This milestone underscored the power of reinforcement learning to unlock advanced reasoning capabilities without relying on traditional training methods like sft. End quote and more from Jeffrey Emanuel Quote Deep SEQ has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, Deep SEQ was able to train these incredible models using GPUs in a dramatically more efficient way. How in the world could this be possible? How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital, GPUs, et cetera? Wasn't China supposed to be crippled by Biden's restrictions on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of deepsea was the critical ingredient to make them more creative and clever. Necessity being the mother of invention at all. A major innovation is their sophisticated mixed precision training framework that lets them use 8 bit floating point numbers FP8 throughout the entire training process. Most western labs train using full precision 32 bit numbers. This basically specifies the number of gradations possible in describing the output of an artificial neuron. 8 bits in FP8 lets you store a much wider range of numbers than you might expect. It's just not limited to 256 different equal size magnitudes like you'd get with regular integers, but instead uses clever math tricks to store both very small and very large numbers, though naturally with less precision than you'd get with 32 bits. Deepseek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later, losing some quality in the process, DeepSeq's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs. This dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall. Another major breakthrough is their multi token prediction system. Most transformer based LLM models do inference by predicting the next token, one token at a time. Deepseek figured out how to predict multiple tokens while maintaining the quality you'd get from single token prediction. Their approach achieves about 85 to 90% accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing, it's making structured contextual predictions. The brilliant part is this compression is built directly into how the model learns. It's not some separate step they need to do, it's built directly into the end to end training pipeline. This means that the entire mechanism is differentiable and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower dimensional representations of the underlying data than the so called ambient dimensions. So it's wasteful to store the full KV indices, even though that is basically what everyone else does. Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency. Again slashing the number of GPUs you need to train a world class model. But it can actually end up improving model quality because it can act like a regular, forcing the model to pay attention to the truly important stuff instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better. At the very least you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of trade off you are faced with in AI training. Another very smart thing they did is to use what is known as mixture of experts or MOE transformer architecture, but with key innovations around load balancing as you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model, either the the weight or importance a particular artificial neuron has relative to another one, or the importance of a particular token depending on its context and the attention mechanism, et cetera. Meta's latest LLAMA 3 model comes in a few sizes, for example a 1 billion parameter version, the smallest, a 70 billion parameter model, the most commonly deployed one, and even a massive 405 billion parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inferen least if you deployed it in the native full precision version. Therefore, most of the real world usage and excitement surrounding these open source models is at the 8 billion parameter or highly quantized 70 billion parameter level, since that's what can fit in a consumer grade Nvidia 4090 GPU which you can buy now for under $1,000. So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability or the model's iq, if you will. It turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plane geometry, SAT math problems, etc. Okay, look, as I said, this whole day is about Deep seq. And here's more of why. Quoting Axios, this could be an extinction level event for venture capital firms that went all in on foundational model companies, particularly if those companies haven't yet productized with wide distribution. The quantums of capital are just so much more than anything VC has ever before dispersed based on what might be a suddenly stale thesis if nanotech and web3 were venture industry grenades, this could be a nuclear bomb. Investors I spoke to over the weekend aren't panicking, but they're clearly concerned, particularly that they could be taken so off guard. Don't be surprised if some deals in process get paused. There's still a ton we don't know about Deepseek, including if it really spent as little money as it claims. And obviously there could be national security impediments for US companies or consumers given what we've seen with TikTok. But bottom line. The game has changed. And finally, let's end with Joe Wiesenthal taking the contrarian view just a bit. That is maybe if AI is a race down to becoming a commodity, that could be a good thing. Suddenly everyone is talking about Jevons Paradox. This is usually discussed with respect to energy markets. Basically, when you get more energy efficient, you don't use less of the energy source, you just use your efficiency gains to do new things and demand keeps booming. This is certainly the hope if you're an Nvidia or any company that builds underlying AI infrastructure that everyone will use the deep SEQ breakthroughs and just race even faster with no effect on total demand for compute. We'll see. As I'm typing this, Nvidia has opened down about 13%. Certainly investors aren't taking much comfort in Jevons Paradox right now. One of my favorite Tracy Alloway lines is that it's only a crisis when you can't throw money at the problem. Covid was a crisis because money alone wasn't enough to address it. The supply chain shocks were a crisis because money alone couldn't fix the problem. There's no guarantee here that just throwing more money at us tech companies will be enough to keep them competitive in AI, let alone chips, if it's perceived that they're falling behind. Human capital talent takes years and years to develop. Getting the incentives right is not something where you can snap your fingers overnight and make things happen. These are big, slow moving things. Nothing more for you today. Talk to you tomorrow.
