
Loading summary
A
Hey everyone.
B
I'm super excited to be sitting down with Sebastian Raschka. He's the author of the Build a Large Language Model from Scratch and Build a Large Reasoning Model from Scratch book and video series. As an LLM research engineer, he's bridged academic teaching at the University of Wisconsin, Madison with the Hyper Practical working for the AI development platform Lightning AI. But what I love about Sebastian is that he has zero appetite for AI hype and can dive right into what you can actually do with these tools. I want to ask him what impact AI is having on coders? Is it really going to make them obsolete? What real capabilities can we expect AI to develop from here? And who should be actually building an LLM from scratch? Let's find out. Sebastian, thanks so much for joining today. For those who don't know, Sebastian is the author of Building an LLM from Scratch bulk book as well as video series on YouTube. Sebastian really gets deep into the technical aspects of LLMs and how we can actually create these and build our own and get beyond the hype. Sometimes we hear about of AI, but just before we get into who should be doing that, what that looks like, I wanted to zoom out a little bit and Sebastian, maybe you can tell me a little bit about in your view, the state of LLMs in 2026, how are the capabilities advancing? What do you see on the horizon for this technology?
A
Yeah, thanks for first, thanks for inviting me on the podcast to talk about LLMs. It's one of my favorite topics so I think we will have a lot of fun in this episode. But yeah, you began with a very broad question here. Like the state of LLMs in 2026. I would say 2025 was particularly interesting because there was at the beginning of 2025 deep seq and then this new paradigm. We can maybe get into this in more detail later. But the reinforcement learning with verifiable rewards which is like a technique to develop reasoning capabilities in LLMs. Reasoning is also in quotation marks. It's a broad topic. It's like the reasoning in LLMs. I would say we shouldn't take it too literal like how humans reason, but it is like a set of techniques that make LLMs better at solving complex tasks. And 2025 was pretty dominated by this idea of developing these so called reasoning sometimes called thinking models. So everyone like from OpenAI to Google Cloud, Grok, all the open weight LLMs they all have now like different variants of LLMs like the regular instruct or the regular variant and then the thinking variant we can maybe talk also later a bit about the trade offs here. But so I think this was like 2025. We will see this kind of like continue in 2026 because these techniques, they are still relatively new and people are currently, I would say in the first iteration or version of these techniques where okay, this works and now it's like let's hone in on that and let's make that even better, add some tips and tricks to that and like really like exploit that type of mechanism. So we will see more of that at the same time I think also I would say a lot of progress came from the inference scaling side. So that basically two paradigms for LLMs. One is like the training and then is the usage, the inference. And there's also like a trade off. You can spend a lot of money in training which is very expensive and then you get that one model and it gets maybe used for a few months and then it gets replaced by the next model or you may also even update this. But yeah, training is very expensive and it gives you longer training like the scaling laws. Training on more data gives you better models, essentially better performance. And it is expensive though. So you can also spend money extra compute. And I mean extra compute usually costs extra money because you need more resources during inference. That means after the training when you use that LLM so you can say okay, instead of let's say I have a user and the user has a query, instead of just giving the first answer, you can have maybe three answers. Like if it's a math question, you have the LLM with different settings running three times and then you take the highest scoring one or the majority vote, but that will be three times as expensive. So that's one version of inference scaling. There are other techniques, for example generating longer outputs and that sometimes helps the LLM also to think through a problem like in quotation marks think coming back to the reasoning. And so I would say to answer your question before we go into technical details, 2026, I would say it is still on that trajectory of we can still make a lot of progress in training, especially maybe not so much that pre training because that's not where the low hanging fruit is anymore, but more like the reinforcement learning for the reasoning capabilities and at the same time more clever inference scaling techniques like more of that. So they're like these two things and I think this is going to continue. I don't see anything right now surprising on the horizon. But this is always like that in AI. If someone knew a new architecture that is much better than the current status quo. That's like a trillion dollar idea. Basically. We would. So if someone has something like that, it wouldn't be something someone had shared already. So it's always gonna be a surprise if something like that lands. But I don't see any indicators or any, anything that, you know, gives me like, like the confidence that there will be something really, really different. It will be more like honing in on these things, basically.
B
I think that's really interesting. And so much of, you know, your outlook, Sebastian, comes back to this notion of reasoning or inference and you know, use the word thinking in quotation marks. But the reason this is so interesting to me is it sounds like, you know, there's not going to be, in your view, a huge, you know, step up in 2026. From what we see right now, we're going to continue to see performance improvements. But the reason I find the reasoning piece so interesting is so much of what you hear in the media and so much of the noise coming out of, you know, Silicon Valley and From, you know, CEOs, you know, in any industry is these sort of, you know, grandiose statements about what they will use AI for, what AI can do and how transformative it is. You know, I'm curious in your mind if you can, you know, share a little bit more about what are some of the better use cases for this and maybe, you know, debunk some of the things that this technology is just still not going to be dope to be able to do at the end of 2026. And I think specifically when you mentioned reasoning about, I call it the Strawberry problem and you may have heard of this example, but the issue right now that if you ask a lot of the leading LLMs, how many Rs are there in Strawberry, it can't accurately answer that question. It says, oh, there's two Rs, there's one in straw and one in berry, which is obviously not right. And, and sort of lays bare some of the reasoning limitations here. So what will this be good for at the end of the year? What will it still not be good for at the end of the year?
A
Yeah, so I should also. Well, yeah, it's also a broad question, like not a super broad question, but like, I wanted to preface this with another point to add on, on my previous answer with the reasoning capabilities. It's also a spectrum, like. Well, what I'm trying to say is the same LLM with, let's say different inference, scaling can have different capabilities. I mean, or so basically you can have an LLM and you can. So coming also back to the Strawberry problem, you can use it in the high power reasoning mode, but it might make mistakes on simple tasks like that. It's called like overthinking. But it's also something, I mean again I don't want to say LLMs think like humans, but it is also something humans think suffer from. Like, you know, I know a lot of things. I, you know, I'm a researcher, I can do a lot of things, but I don't know, ask me at 11pm what like a simple math question like 21 times. I don't know, 11 or something like that and I will give you a wrong number. Maybe because I'm tired, my brain doesn't work anymore or in other ways, like I sometimes make really dumb stupid mistakes. But it doesn't mean I can't do these other things. And I think for LLMs that's also true where the counting the R in strawberries, it's almost like, well it's not like you're not evaluating it in the real use case you care about. I would say like this is something where you would not use the LLM and you can actually now I think that's also one of the big drivers of progress in 2025 and going to be in 2026 is tool use. Like the LLMs can use tools and so they don't have to do everything from memory like counting the R's and Strawberry instead of the LLM trying to do that. And I think the limitation here is usually around the tokenization and like basically how LLMs work. But well, instead of having the LLM tell you based on its own internal compute, the LLM can do like a tool call. It could use a Python interpreter like read this as a string and then just use string finding like the letter finding in the string and then it gives you the accurate answer. And so I think we also have to think about it like this, that there are different modes how we use LLMs and you know, your mileage may vary and problems, you know that well. So coming back Also again to 2025 again there was OpenAI, Google and some others that participated in these Math Olympiad type of competitions and they got really, really, really good results. I think. Well, what they call gold level performance. But at least for ChatGPT, OpenAI they didn't use a model that is publicly available. So they used like some custom version. And this usually also involves inference scaling. So for example, the same is true for deep seq math version 2 they had like a paper and they had that same LLM and they cranked up the self refinement steps where the LLM evaluates its own outputs and refines those and has multiple tries at each problem and it boosted the performance significantly. So what I'm also here trying to say is like it depends a bit on how you use the LLM. It's like a small LLM can be really good at being, you know, efficient and cheap at a certain problem but it's not going to solve all your tasks. You can special specialize the LLM more for complicated things but then it might fail at another task. And right now I think, well when you go to chatgpt.com for example, this is like a general purpose model and it has some modes where it has this auto mode like thinking, non thinking, deciding what is the right one. But it is still trying to be you know, like a jack of all trades in that sense that some people use it for summarizing emails, some people mean, you know, ask medical questions, some people use it for coding. And so well it's just like at that it's pretty good. It does a little bit of everything but it's not super specialized. And I think with LLMs right now the most, the biggest use case and the most promising or utility wise the biggest use case right now is coding for example. It's really good at coding and we'll see how well it performs at other things and but yeah, I think I'm diverging here. That was like a question to at the end of the 2026 year what is like some of the tasks? I would say coding for sure it's maybe the boring answer but it's a text problem. It's pretty easy. Not pretty easy but it's pretty approachable for LLMs
B
if you work in it. Infotech Research Group is a name you need to know no matter what your needs are. Infotech has you covered. AI strategy, covered. Disaster recovery, covered. Vendor negotiation covered. Infotech supports you with the best practice, research and a team of analysts standing by ready to help you tackle your toughest challenges. Check it out at the link below and don't forget to like and subscribe. I completely agree with you. From everything I've seen this seems to be one of the low hanging fruit areas where it seems like, you know, we can, we can, you know, dramatically improve, you know, productivity with developers. Here you had a quote I came across in your blog where you said that I still write most of the code I care about by myself without AI. Is that still true? And what do you think the implications are in terms of how developers and development teams use LLMs or don't use LLMs for what they're trying to accomplish?
A
This is, to a large extent, still true. So, but, so I, yeah, I also use LLMs for coding in different. I would say in different ways. The other week, I wrote a program, like, I can do Python coding. I'm a scientific coder. I can use Pytorch, Python, some other languages to, you know, like, for scientific computing. But I'm not really a web designer and I. Well, I can't really build apps. You know, like, if you ask me how to build an iPhone app or macOS app, I have no idea. I've never done that before. But for myself, I automated things all my life, like, I don't know, 15, 20 years ago. I usually write scripts that do something to help me, like rename files or these types of things. For example, for my blog post, I have a workflow where I have all these images in one PDF and then I usually export it, and I have a script that converts it to different crops. It automatically and converts to different file formats. But it's still, like, a bit tedious. I have to go to the location of my script, type the commands and everything. So I thought, okay, I can just make my life easier here and develop a macOS app, like a native app, where I can just drag and drop the file in and it performs the cropping and the conversion and everything for me. And that is some. I used an LLM for that. It took a few hours, maybe one or two hours to just get it right the way I wanted it. But it's something I would have not been able. Would have not been able to do otherwise. Like, it's like, it's kind of like magic. I have Now a native macOS app that does that for me, and I can do that for a lot of things, like everyday lives, things on my computer. For that I would like just use the LLM. I don't really care about how it does it. I mean, I can see, okay, it works, and if it doesn't work, I can give it some more prompts. I have no idea how SwiftUI works. I could probably figure it out, but I always wanted to learn it. But that's something where I don't have time to learn it. There are so many other things that are more important each day than just learning how to do that, because it's not, you know, my main job Essentially. But then for things I care about, for my, let's say, research experiments there, I usually write most of the code myself just to figure out, like, just to think also through the problem and to, through the shortcomings, just to get an idea like, okay, this is. I have a very good idea what I want to do, but I don't often, let's say. Or there are cases where I make mistakes and I usually use an LLM to like just get a second opinion. Like, hey, does it look okay? Like it's something where back in the day I had a colleague or something. Or you do PRs on GitHub, open source projects where there are other people chiming in and it is like another layer between that before you share it with other people. You do like a sanity check with LLMs to make your work better in a sense, like to, you know, just have a, like, kind of like a proofreader second pair of eyes in that sense. It's really good at that sometimes. Also it has suggestions to make things more stable sometimes. I also, I have multiple experiments. They have, let's say, different settings and I have written some code here and then I fixed it here. I now want to apply to all the other scripts, all the other experiments, and then I would use an LLM and say, hey, look at this. What I've done here. Now basically copy that over to all the other ones. I could do that manually, but sometimes it's tedious and this is something that I can easily review because I know the code and I can see what changes it made and oh, that looks okay. I just, you know, check. Okay, next, next, next. And in that sense, I do use LLMs for my coding workflow, but depending on the context. Well, I don't want it to do everything for me in a sense because that is. Then I have no idea what's going on anymore. I want to do it more like in a controlled type of way for the work I care about. That was basically what the quote was also about.
B
Yeah, it makes complete sense to me. And the reason I bring it up is in the context of. There's certainly some alarmist narratives out there that you may have heard. You know, the most extreme one basically saying that, well, you know, computer sciences or development as a whole discipline is going to be obsolete because we won't need to hire these people anymore. Everybody can vibe code and the machines can do it by themselves. And there's a more balanced version of that as well that just says developer productivity is going to be so radically Changed that one developer can have the same throughput as maybe 10 developers could a year ago. Do you see that as holding any water and holding any water in the next two years or is that just sort of fanciful thinking?
A
I think there is a kernel of truth in that, in a sense that it is true. I noticed that myself. The use cases I just described that it just goes faster if I tell, let's say the LLM apply my patch that I have here to the other files or something like that. So in that sense, 100%. Also, for example, for my own website, I added a dark mode button that I otherwise it would have taken me like weeks or months. I mean, I had it on my to do list for like years and I never got to it because I knew it's going to take a lot, a lot of work. And then LLM did that in one day basically. So in that sense, yeah, it is true. It kind of like makes things faster, but it still work. You know, it's like even like with this macOS app I described, it still took me a few hours to do that. And it's a very basic app. Um, and so what I'm trying to say, it's not making people developing code or designing apps or building apps obsolete because it's still work. You can't just say, I mean, maybe one day. But I don't even think that's true. Where you can say, okay, build xyz, it will build a version of that. But usually the first version is not the final version. So there are iterations, you have to test it, you have to use it, you have to tweak it and that is going to be still work. I think what will change is that what I'm hoping is with LLMs, apps like that or websites, they get better than they used to be. I'm hoping it's like people use LLMs to improve, but they would build otherwise, but not to just have more low quality work. I think that's what I'm hoping for the future. But everything is still work. I also noticed that for, I mean my own experiments, it's not just the code, it's just, it's also running the experiments, doing the comparisons, thinking of additional things to compare and that's all still work. I think the same is true. Like there are some people on the Internet saying like, software is basically free now or like, yeah, free in the sense that an LLM can do it. There's no value in, let's say open source projects anymore. And I don't think that's true because. Well, I think I would always take something that has been developed over many years and tested over something that the LLM gives me as a one shot solution basically because, yeah, people spend. I mean, I think the best of both worlds is if people use LLMs to improve things that are already there and build new things, but then iterate over it and just make it better than they would be otherwise, like adding more tests, making it more robust, patching bugs and that type of thing. And it's going to be still a lot of work to do all these things even if you have LLMs. Yep.
B
And it feels, that makes complete sense to me and it, it feels even more reasonable given the conversation we were having earlier about the fact that if we really want to push any of these LLMs to their limits, that probably won't be done through the generalizable, you know, just ChatGPT standard model. It's looking at, you know, some of these, you know, unique use models for unique use cases really. And as soon as you're starting to build out some of those, you need someone who actually understands what they're doing to be able to set those up appropriately. And so I mean, first of all, I want to feed that back to you because that was something I took away from our earlier conversation that it sounds like just trusting a singular LLM to be able to push the frontier in every given area is going to be less effective than being able to have more specific ones for specific tasks. Is that fair?
A
Yeah, that is a good characterization. And it comes back to older problem in deep learning. Like deep learning is the field of training neural networks, artificial neural networks. Because an LLM is essentially at the end of the day deep neural network. And the one problem is basically if you train it on one thing, it will forget other things. It's like, but it's the Same also again, LLMs work different from humans, but it's the same like for us, right? I mean if I just solve math problems every day, I will get really good at math. If I don't do math, then for a few years like do something else. You, you forget things, you know, like it's because you get new information. You don't, let's say hone in on your skills and then it's kind of like that. And the same is true for LLMs. So people, when they develop or pre train LLMs, they're very careful what goes into the pre training mix and also in which order. And then once you have that base model, you Fine tune it usually. And then in addition you have also often domain specific fine tuning. It's like an older paper but I think it was Code Llama which had like a nice graphic on that, how they developed the model in these different stages. And usually at the end the last stage is more like what you really care about. I mean if you want to develop for example a coding LLM you always have to have the coding data already in the pre training. You have to carry it through but, but at the end you will have like a specific phase where you just fine tune it on coding problems. But then it will probably get worse at math or Spanish or something like that. And that's the trade off. So you start off still with a generalist LLM but then you kind of specialize it and that trades off other skills. So it becomes worse at other things basically. So yeah, that's basically how it works.
B
Right. And I can wrap my head around that and it's, it makes the case really nicely for building some of these more specialized models. So I'm curious Sebastian. You know we've got the proprietary generalized models sort of on one side of the spectrum. We've got you know, completely building your own LLM on the other side of the spectrum and then in the middle and you know, feel free to reject my characterization here, we've got sort of the you know, custom GPTs or finding ways to customize some of their proprietary models. When in your mind does it make sense to be customizing a proprietary model versus building your own and you know who are the types of people and what are the use cases that make the most sense when we start to talking about building your own.
A
Yeah, so there are different levels of that. You can essentially start from scratch like just pre training, fine tuning your own model like that's like the most work and that I would say I would not recommend anyone doing except you are a company who, well whose goal it is to build LLMs essentially. Or you are a big company with a lot of money and really want to do something specialized. I remember it's a few years ago now I think Bloomberg had a pre trained a model from scratch and just focusing on their news headlines and writing news articles or something like that. Like at that scale maybe it makes sense but it's still, it's going to cost millions of dollars so it's you know, it's not cheap and some I think, well I think what we are going to see is big like fields like no finance law there. I think it does make sense if you want to develop because like lawyers can't just use ChatGPT for data privacy reasons and other reasons. And so I think like if like there is some like people get together in that field, there's a lot of money also in that field they could spend a few dozens or millions, hundreds of millions of dollars to develop an LLM like a base model for law type of things. But then it's like this general law model and then you still have to maybe fine tune it on your internal data at your company or something like that. But so yeah, one thing would be completely from scratch and like I said I would not recommend it because it's very expensive unless a select few players might want to do that. The second use case would be. Or the second variant would be you take an existing pre trained model and then you specialize it. And I think that is more feasible because there are a lot of LLMs out there in all different sizes. Like all the open weight models. For example deep sea really big model, quent 3 is very popular model or even from OpenAI the GPT oss model, an open weight model, it's free to use and you can then fine tune it but. But it's again not trivial so you will still end up spending tens to hundreds of thousands of dollars if you want to have a really good model there. Depending on the size it might vary, but it's also something as a hobbyist I think that totally out of reach unless you really really are passionate about something. But the problem is if you want to really build something competitive like that, well you have to have a big user customer or use case for that because it's going to cost and it will also at some point become obsolete. For example if I, I don't know when coin 4 comes out but like let's say I today use a Quin 3 base model that's from summer 2025 and I spend a lot of money and time to make it really good. I don't know, a few months there's Quinn 4 or some other model is much, much better. And then my model is completely obsolete and I have to start over again. But yeah, it does make sense still for certain things. And then so taking a base model, fine tuning it and the third use case would be taking an existing model and essentially just customizing it with a prompt. And I think that's what a lot of people do. They use an API where they don't even host the model and then with a prompt you can steer it in a certain way. It's not going to be perfect, but for a lot of use cases, it gets you a lot of like, bang for the buck because you don't have to train your own LLM basically. But again, there are limitations. For example, if you use an API, you have restrictions. You can't use your private data, or you shouldn't at least use your private data because data is public. There were a few instances on the news in 2026 where prominent people did that data leaked. So I think, yeah, it really, really depends on what your goal is. But if I, for example, today need something to translate, I don't Articles from one language language into the other, I wouldn't go out there and train my own LLM. I would just use like pick one of the popular ones like ChatGPT, Gemini, use a prompt for that and see how far it gets me. Basically. Yeah, right.
B
And I'm, you know, chuckling a little bit and I have to ask, I want to come back to something you said which is, you know, you're sort of steering people away from using, from building an LLM from scratch, which I'm chuckling at because, you know, it's something that you've obviously invested a lot of time teaching. And so for the people that you know, you're teaching this to, is it mostly people who are interested in learning this, either as hobbyists or they're interested in learning the fundamentals of how this works, so that then when they use LLMs, you know, in a more commercial setting, maybe it's more proprietary, they understand the basic building blocks. Who's typically interested in this?
A
Yeah, you bring up a good point, because it sounds paradoxical. On the one hand, I'm building these things from scratch and then I tell people not to use them. So I would say I'm not trying to steer people away from it. It's more. I want to set the expectations right. Like just what you said, like who are these people who should be doing it? And I think so. I also know from my experience how much work it is to build something from scratch. It actually better than something out there. So. So if there is someone, like there were some readers, for example, they stumbled upon my book and they are not coders, they are like from different fields. And the expectations, oh, I read the book and I will be able to, let's say there was actually a case with language translation, an LLM that translates documents for me better than ChatGPT. And that's not going to happen, basically, unless you spend, unless you, I mean, LLMs also, usually they're not trained by a single. They are usually trained by a large team. And so to answer your question, we can use an analogy. For example, like something else. For example, let's say you are interested in cars and you want to learn how, you know, you just, you're passionate about cars. You want to understand how the cars work, the motor works, everything works, the steering and everything. But you would not build a Ferrari. Like, you know, it would be very expensive as a single person, I mean, you, that's even, that's not even in reach. You would need a team, you need the factory, you need the design documents. You need a lot of time and money and millions of hundreds of millions of dollars to develop a Ferrari. Instead, you would maybe develop, you know, like a simpler, I don't car that would resemble a car from the 1980s or something like that. Something you can build in your garage. But building that car, you will understand how the Ferrari works. If Ferrari is essentially just a fancier version of that. And so the book is in a similar way, the goal is kind of like to understand how things work. It's for people. Well, I mean, people who maybe want to build these large ChatGPT types of models, education wise. Because right now, I mean, how would you learn that? You know, like, how would you to get hired at the company? You have to show usually that you have some skills already. And so you have to start somewhere. And that could be an entry point to learn how to build these LLMs. But it's also for people who don't even want to build LLMs. They just want to understand what are the limitations. Why does the LLMs struggle with strawberry? The number of letters there, like, what goes into the LLM? How does the whole workflow work? And one way would be you could explain everything conceptually, you can explain everything in words. And like, yeah, the LLM has text, it tokenizes it, it converts it into numbers and they go in and then there's some computation. But that's all very vague and could be misunderstood because there are a lot of like, things, details, you glance over. And I think the best way to really see how it works is, yeah, by actually doing it by going through the actual steps. And these steps also don't lie. At the end you have the working LLM. And so you know, okay, this is actually working. This is not made up. It's not like fantasy explanations. It's really concrete. It works. And that is essentially also part of the goal. At the end, you can develop your own LLM but then the disclaimer is it is a lot of work to get something that is really competitive, basically.
B
Right. Because the setting up the LLM in some ways is not even the hard part. There's the training, there's the pre training, there's the data set you have that like all of that is what, you know, separates your car you made in your garage versus the Ferrari. Right?
A
Yeah. And it's also a good point. In the, the data set you mentioned in the book. I'm only using public domain data from Project Gutenberg, like a simple example book that is a public domain hundreds of years old where because of copyright concerns, like to not just do that, but if you want to build a real big LLM, you need trillions of tokens. Like you need terabytes of data, basically. And that would be impossible for a human who buys the book to do because you would have to buy all the hard drives you have to buy, rent hundreds of GPUs and everything. And that would be really, yeah, not feasible. And in that case, the book is also data wise, focusing only on a very small data set. And the model will then learn how to write text that is similar to this book, basically. But it's not like going to be, you know, your next chatgpt, because that's impossible. It's like, I mean, not impossible, but you would, for a single person, it's kind of not feasible. You would spend a lot of money, a lot of time and you know, and so the goal is really helping explaining things.
B
Right. Let me, let me take this in maybe a slightly different direction. But the data set conversation got my wheels turning a little bit. One of the challenges for a lot of organizations trying to do this themselves is the basically marrying capabilities of an LLM with the quality of their current data that they may have sitting in their enterprise applications. And you know, I think it's easy when we talk about, you know, a book, for example, you ingest a book and you, you know, kind of tokenize the words and, you know, all the characters and all of that. When we think about, you know, whether it's data that's structured in a database, whether we think about it as, you know, the unstructured data that maybe people have around that or metadata. How capable are LLMs right now at making sense of that? And is that something that you see changing in the next year or two, or is that an inherent limitation of the, you know, of the structure of the LLM? And basically the reason I'm asking that is for organizations that are worried about the quality of their data. Is that a problem that's being misunderstood? Is it going to go away soon? Or is that just an inherent limit of this model that they're going to have to come to terms with sooner or later?
A
So let me just try to rephrase to see if I understood your question correctly. The limitation is working with your personal data data you have on.
B
That's right, with organizational, proprietary data, call it customer data or employee data or something like that.
A
Yeah. And so there are different ways you can work with that data. So usually the limitation of LLMs, it's still a limitation, but there are so many tricks now that this becomes more feasible. And the limitation was usually that you can only fit so much into the context of LLMs. And see, this is where a book like from Scratchbook would come in handy, because you would see or understand, like, what is the context? How does the LLM process the data? And there's a limitation to the size. If I crank that up, it's going to be really expensive or it's sometimes at some point exceeded. So I can't just put everything into the context. I have to be smart about it. And so traditionally, I mean, idea, in an ideal world, you would put everything into the context. But that, that doesn't work because it's too expensive. So people developed something called a Rack rag, which stands, I think, for retrieval, augmented generation. And so that's like a application layer around the LLM where you take the documents you have that you care about, and you chunk them up, put them into a database, and then the LLM you query the LLM, it produces like a vector embedding, like, let's say, a compressed version of your thing, like the query. And it looks for more like simple, like let's say dot product math. You look for similarity to other chunks in your database and then you retrieve that chunk and you hope that this is going to be relevant. So it's essentially like a smart lookup. But you could also, in simpler ways, think about it like that you have a query and then you chunk up your document into smaller parts, and then you go through it iteratively and try to find what is the most similar one. And then can the LLM use that to answer the question? So, for example, let's say you are in a law firm and you say, what was the case in 1983 where XYZ happened? And you try to pull that out, and then the LLM can use that as part of the answer. Basically, it's not Perfect because you are chunking the document and it's not the full context, you have always little chunks. But one of the, I wouldn't say breakthroughs, but one of the. It's because it's more like a continuous development. But one of the progress, parts of the progress we've seen in 2025 was that contexts are no longer, the supported context sizes are longer. So we are now, it really depends. But there are like, even like open weight LLMs like Nvidia Nemotron, they can do up to 1 million tokens. Of course it's going to be more expensive, you need more GPU power for that. But I think even like, you know, ChatGPT Online, the version, I think it can do 100,000, 200,000 tokens. And I think that's about the size, I mean, I may be wrong, but I think it's about the size of like one of the Harry Potter books, the first one or something like it's a long context. And so for many people this is actually sufficient. So you don't need any specific fancy application around the LLM to process that. You just put it in there. And there is the problem of, it's called like the needle in the haystack problem where what people found though, I mean there are multiple problems, but there's also something related to attention syncs where the LLM kind of focuses more on the beginning of the text that you put in. But then also the needle in the haystack problem is when people develop these long context supporting LLMs, they have, let's say you have a question like, I don't know, like some, some factual data you want to retrieve. It's kind of buried in these hundred thousand words. The LLM should find it. And the longer the context, the harder it is for the LLM to answer correctly because there's a lot of noise, it sometimes gets distracted. But I mean it's similar for us humans too. Like the more stuff you throw at it, the more complicated it is to figure it out. And so that's kind of like, like where people, I would say where companies made a lot of progress last year to make that better. So it kind of most of the time works and I would recommend for most people just try that first instead of building something. It's also like how I approach problems. You do the simplest thing first. You write down what performance you get and then you try to iterate and tweak it and try other things and see if it's better. But before using the most complicated Thing always try to, you know, do the obvious simple thing and then maybe that gets you already most of the way there and then you can iterate later and see if it's worth the effort to iterate there. If it's worth spending three months to build something around it that can do it maybe 1% better. One limitation though is that the case you described, you may not want to do that with ChatGPT because, well, your data will be online. I think beginning of 2026. There was like a case where, I mean, based on the news I read that a government employee uploaded some sensitive documents that got leaked. I think they found out because it appeared in answers from other people. It was like a confidential, high security type of data. And so yeah, I think as far as I know, I mean, I don't want to say anything to like anything that's wrong, But I think ChatGPT for example, does use your data for training their models. They don't really like, you know, I don't think they specifically single out specific data and like publicize it or something, but it's implicitly used for the training. They try to anonymize everything but well, if you upload it to ChatGPT you have to be aware it might be part of the training data. And so you can't do it for everything. There are laws for certain fields where you can't just, you know, conscious share patient data like it's sensitive data and. But then again you can use a local LLM that runs locally, for example. I mean most LLMs that run locally support up to 160,000 tokens, which is like again the Harry Potter book. It's a lot of data and there's special ones like Nemotron 3 from Nvidia has 1 million tokens. So there's always something you can do locally that gets you most of the way there. And at the beginning, one more thing, let you interject. There's like a paper that I found really interesting at the beginning of 2026, let me see, I think it was called Recursive Language Models. So the title of the paper. So what they do is it's kind of like a clever trick. They have like the query and they want to answer, let's say a question like, similar to a rag setup where you want to, you have a lot of data to process a whole document base a whole folder of, let's say a lot of data basically that can't fit into the context. And so what they do is they parse so they instead of letting the LLM do everything. They parse the input into like a string in Python in a coding environment and then let the LLM come up with ways to like to chunk it up into sub problems. So for example, if your problem is, let's say summarize, I don't know, let's say summarize all the chapters in this gigantic book or something, you have 12 chapters. Instead of feeding the whole book into the LLM and trying to have all the summaries for the 12 chapters, what you say, what the LLM decides to do, oh, maybe I can just have one chapter each and process it in 12 different parallel, let's say execution loops. And then I just pull together each summary from all the 12 chapters and write an overall summary or something like that. So like just chunking it, it's not really like rocket science, it's really just using a coding environment where the LLM can use a tool to chunk up everything itself. But that gets you already most of the way there. And they did that also with ChatGPT. It doesn't have to be a local LLM. You can do it with APIs, with tool calls. And so there's a lot of, I would say in the field there are a lot of workarounds where you have limitations in the LLM itself, but you solve the limitations by doing clever tricks in the surrounding API layer or like the application basically.
B
And does that fall under the bucket in your mind of reasoning enhancements or is that something else?
A
I would say it's something else. It can be reasoning related. So if it's like a problem that requires good reasoning capabilities. But I would almost, if I had to group it into something, it's more like general inference scaling, for example, where I mean, it's not even inference scaling in that sense that it makes it more expensive. You're just chunking it up basically into separate, into separate sub calls. But like you said, I mean it could be related to reasoning if your query is a reasoning query. But here I think it's also a bit tricky because if you have like, let's say your task is to solve a math proof, like something really complicated, where you have a lot of sequential steps where you derive all the questions like the individual intermediate steps, I think that would not be a good case for the method because the method runs things in parallel, like sub calls and parallel kind of like independent of each other. And reasoning models, they usually benefit from this so called chain of thought where they think through a problem, which is more like sequential, right and the reason
B
I'm asking, and it probably gets a little bit outside of your field of expertise, Sebastian, but I'll ask anyway, is when we think probably in other, you know, quote, AI or automation applications that start to get outside of LLMs in the transformer model, but maybe have some overlap with them. You know, I'm thinking about, you know, agentic AI or some of these AI use cases where, you know, these different kind of, you know, AI systems work with each other to understand what a task, you know, what the outcome of a task should be and actually, you know, orchestrate to get it done. It seems like there's some overlap in terms of the processing here and being able to understand, you know, what's more likely and chunk it up, is that, you know, does that come into play or is that completely off base?
A
I do think it's a very good point. I do think it's kind of like related in a sense, like to understand the input, process it and present it in a way that can then be processed. In this particular case, instead of interacting with other agents, it's kind of like interacting with itself. But it could be other agents too. I mean, it's basically how can I divide and conquer my problem here? In a sense. But then it calls itself on the chunk problem, but it could be as well like delegating it to other types of models. And I would say that is one of the biggest progress drivers in the recent months. That, yeah, the tool calling, like using different tools for different things and stuff, trying to do everything itself. And I think that's also where a bit of, let's say the magic comes in when you use something like Gemini or ChatGPT or Claude. It's not just the LLM. I mean, it's just a hypothesis. But I do think if you take something like Deep SEQ or some other open weight model that is really good and you would put that into whatever like framework they have behind the scenes, let's say Gemini or ChatGPT. I think it would be almost like identically or similarly good. Like. So what I'm trying to say is I don't think the LLM itself is necessarily the differentiating factor anymore. They're all kind of like similarly good. But what is really important is how you format things like how to deal with context and how to. The history and then how like the previous conversation. The previous. Yeah, like the previous back and forth, how you process that, how you use tools and that that stuff where there's a lot of work that goes into making it really, you know, robust. I mean, even I, I don't know exactly because it's proprietary how they process input at, let's say, chatgpt. But when, if I sometimes type something, I make a typo and it's sometimes even like a relatively technical term. I have a typo in my prompt. But often I don't even care about fixing it because I know, okay, it's kind of dealing with that already. So it' instead of all delete, delete and fix that word, I just have it with a typo in there. And I can see based on the response, because often the response involves repeating part of the answer. I can see it fix the spelling of my word. So I don't know if it's necessarily like the LLM itself because it has all the different spellings or tokens or subtokens or there could even be like, you know, like processing layer, like fix obvious typos because that enhances the performance of the LLM. So. Because then instead of having to have a huge vocabulary or subtokens for all the different ways someone can misspell a certain word, you can just have like a dictionary fix, like, you know, like a simple fix, and make it the nest work for the LLM itself. So I think there's a lot of magic like that happening behind the scenes to improve the performance. And I think that's why you see that the performance of something like ChatGPT or Gemini is better than something you would run locally, where most of the tools that run LLMs locally, they kind of like run it bare bones, without much stuff around it, basically.
B
And I'm glad you brought that up because it still feels like there's so much tied to the quality and the clarity of the prompt and there's some cosmetic fixes it can do for you. But I'm curious on your thoughts on this, Sebastian, around performance benchmarking. Because if we're talking about, you know, variable output based on the clarity of what you're asking for, a lot of performance benchmarks are like, there's just. They seem like they're pretty clear. They're trying to do something quite clear. And, you know, to bring back a point you made earlier, it also feels like you said like some of these, you know, organizations are not using their publicly available models or there's some inference scaling that's happening.
A
It's.
B
That's, you know, not really there behind the scenes. You know, how much weight do you put into performance benchmarks at all right now? And how much should people be Looking at those as a measure of the capability of these tools going forward.
A
Yeah, that's a good question. I think that's one of the biggest problems in the field, how to evaluate models, let's say fairly and so. Well, there are different ways, like benchmarks. Why is there different types of benchmarks? Top of my head, I would say there are three or four ones. So let's see if I can come up with what I have in mind. So that one is basically more like the classic mmlu. So that is like a multiple choice benchmark. And so that one is basically, you know, like a trivia question almost or like, you know, who wants to be a Millionaire type of question. It's like a question and then there are A, B, C and D. And the model has to select one of these answers and people use that usually to test the knowledge. So does the, does the LLM know about world knowledge and math and like basic things? But it's ultimately not how you use an LLM. It's basically you would never give it all the solutions and say, give me A, B and C and D. You would, you know, ask Freeform, for example. But then the free form is really hard to evaluate programmatically because there are different ways to spell it. You can have it as a word, as a sentence. And so that's why they do A, B and C and D. And so like multiple choice. But then it has a lot of limitations. And so, yeah, I mean, I think that comes also like, it's like a minimum threshold. Like, I think an LLM should have a minimum score on these benchmarks to be okay. But then at some point if it, it's passing a minimal threshold, I don't think it matters if it's 90 or 95% or that basically also we have to keep in mind also some people run these benchmarks with and without tool use. And the GPT OSS model, open weight model had a nice chart about that. When you have a model and you allow it to use tools, then it gets much better performance than not. Like for example, if you ask a model who won, let's say the soccer World cup in 1998, if it can try to remember it, but it can also use the tool and look it up on the Internet, let's say on the official website, and then you increase the accuracy on these types of things. So that is one type of benchmark. And I think, well, it is like a minimum threshold. It should be able to do these things and answer these correctly. But it doesn't really tell you how, how the LLM performs when I actually use it and query it and prompt it in different ways because they can be also sensitive to the prompt format. So another way is like these so called leaderboards where they have like a website and then you can use different LLMs for the same prompt and then you can compare the answers and you say, oh, I prefer this answer over this answer. So that sounds like actually more related to what we would care about. Like let's say FGEMINI and chatgpt side by side. It answers a question. Okay, oh, I actually prefer this answer. And if you do that a lot with a lot of people, with a lot of pairwise comparisons, you can use a statistical model, like a Bradley Terry model to convert it into a ranking, like into numbers like 1, 2, 3, 4, 5, 6, you know, like a. So you can say, oh, this LLM is on top one. But this also has limitations because people really prefer also a certain style. Like people are sensitive to the answer style and not necessarily the correctness. So because if I ask a question to an LLM, I usually don't know the answer. Like if I have like a challenging math problem, otherwise I wouldn't ask really. Right. And so then you get different answers that say, and you say, oh, I prefer this one because it's, it's maybe nicer explained or I like the language better, but it doesn't mean it's more correct or something. So. So with leaderboards it's kind of like sensitive to a certain style. There was like an incident, also not incidents, but like a thing with llama4 last summer where, I mean, I don't know the full story because I'm not affiliated with the companies. I don't know the behind the scenes. But what was reported was that they used a different model on the leaderboards and it got really high leaderboard scores. But in reality it was not a good model because it's all like, you know, what is it called? Like, it's like there's no substance, let's say always behind it. It's more like the glamour, like it looks better than it really is behind the scenes than when you actually use it on hard tasks. And so it's always a bit challenging. Another way to evaluate models would be verification. For example, you can have math or code or something like that, like something that you can verify, say math and you have the correct answer, like it's a numeric answer. And there are tools where you can compare to numeric answers. You can usually how it works Is you tell the LLM, hey, I have this problem, you know, and then explain and then write the intermediate, sorry, write out whatever you need and then put the final answer in a box, like answer box. It's usually the latex boxed format. And then you can programmatically retrieve this answer and compare to the reference answer. And then you can use a calculator, like you can use a calculator to say, oh, these numbers are the same or different. Like there are different tools, like, you know, WolframAlpha. I use Sympy. It's like an open source program or a library for Python where you can symbolically compare solutions. And that is really, I mean, this is accurate, right? I mean, if the model follows the instructions and puts the answer in the prompt, I can really, with a lot of certainty there may be some parsing errors, but like 99% of the time you have a fair comparison. But the problem is it's clear kind of limited to get to math or code. With the code compiles, it's harder to evaluate the whole answer at its whole. You can only evaluate the last answer point. And so I think I listed three. So one more that comes to mind, one way to evaluate is using judges, LLM judges. And so it's basically using another LLM and then you provide a rubric like evaluate, I don't know if the answer is correct. Evaluate the intermediate steps, if they make sense. And you give a different criteria and you say, given these criteria, evaluate this answer, given that other reference answer and you can say, give me the answer, like the quality of the answer on a level between 0 and 10 where 10 is best. And then you can do that for different LLMs and average the numbers and say, oh, this model gets 8.5 on average. This model gets 9.5 on average. That is a numeric way you can kind of evaluate freeform answers. But then again, there's always the catch. The catch is, while you're using a different LLM for that, and the LLM might not correctly always evaluate things or fairly, it might have a bias towards a certain answer style. And so a long story short, each of these different methods to evaluate LLMs has its shortcomings. And the best way is to look at all of them together in context, not like a single one, and try to see, well, what are the weaknesses and strengths of the different LLMs on these benchmarks. And that's really hard. And in the end they all look kind of similar when you see like releases. And really you have to use it basically and see, oh, that works for me. That doesn't work for me. You know, it's like, it's really hard. It's one of the biggest problems right now to have a fair comparison.
B
Basically it makes sense and it's, you know, it's interesting because there's at the frontier, there's, you know, these small differences and okay, do I re, do I prefer this style slightly more than this other style versus is it getting the answer fundamentally wrong? And you know, do you use, you know, an LLM judge or like how do you, how do you best answer that question? I am curious, Sebastian. I have to imagine, you know, you've been doing this for a while and I have to imagine the interest in this from, you know, non technical people has, has increased in the last couple of years. What do you see as like, what do you personally see as some of the biggest misconceptions people have about LLMs and how they function and how they can, you know, get, get value and use out of them?
A
Oh, it's a good question actually. I think top of my head. Well, I, I don't think there's a huge misconception, but it comes more like down to expectations. Again, like what we mentioned earlier that well, it's really hard, it's really expensive to train LLMs. I think that's like the, the misconception is, oh, I just have to, you know, do XYZ and I'm, I can do that on a weekend basically. I think that's, that's it. I think it's also a person can understand everything on a basic level. That's not a problem. But I think the challenge compared to previous say problems in the field is that you need a whole team to do it. Like you need an expert in GPU infrastructure. You need the person, the researcher who implements the core architecture. You need to run experiments. It's like it's not usually something someone can do by themselves or on a weekend. I mean, understanding, yes, but doing this whole thing is a lot of work. Looking for example at, I mean, but most people don't share these types of details so. But there's a, I think it was a Llama two or three paper. They had like a very nice section on what it took to train that model back then. And I forgot all the let's say nitty gritty numbers. But they had like, they reported, okay, we trained it on so many thousands GPU of GPUs, but we had so and so many failures each day because it's also like Hardware failed like crashes, GPU crashes and then you might lose your whole model. And so you have to checkpoint it or you have to build in robustness to when one GPU fails, it doesn't crash your whole million dollar run. Right. So it's like there's a lot of that and this usually requires a whole team basically. It's not like a single person because you have to monitor everything all the time. And yeah, you can have notifications coming up and everything, but it's like a full time job for a lot of people to develop an LLM. Yeah, so I think that's thing that has changed compared to machine learning or deep neural network training before. And that's also why you see not that many academic LLMs anymore. For example. There are a few, but for example, back in the day, I mean with convolutional networks, image classifiers, everyone was able to do that by themselves in a small lab because university labs are usually, you know, like a handful of people. But now, yeah, the budget is, you have to have a huge budget, a whole team, a lot of time, a lot of expertise. It's a lot of things. And that's why this is now mostly restricted to companies basically that have resources basically.
B
So I'm going to play that back to you and let me know if I'm getting this right. But it sounds like just based on how much this field has evolved, how many resources the biggest players have, the amount of, of staff required, the amount of compute required, the amount of, you know, just cost required and just raw data required to do this that, you know, if you're looking to do this as a small shop or as a hobbyist, it is valuable to learn how to build an LLM from scratch. That's something that will help you personally or professionally, but it's probably not something in most cases that you're going to then implement, you know, and try and do at scale because it's just not as practical as it may have been five or 10 years ago. Is that fair?
A
Yes, that is fair. Also with a caveat. So I agree with you. But if you fine tune, so there's like the pre training and the fine tuning, if you use an open weight LLM that is already out there and you build on top of that, then it becomes easier. So also just to put in some numbers. So taking the deep SEQ model again because it's just such a popular model, the version 3 and R1 that came out December 2024 and January 2025 because they had some nice numbers in there. If you would rent the GPU they used, they put in the price for like I think $2 per hour. It would have cost them $5 million to train the Deep Seq version 3 model. And this is not including any cost for the like, like staff, like the salary and everything or like the building and that stuff, like the physical building, like the rent for the building to have people there. And it also doesn't include the failed runs because when you do something like that you will fail a lot of times until you find the right configuration. A little trial and error also. But if you only take the solution that worked and you run it, the pure GPU cost would be around $5 million, which is a lot of money. But then if you look at the fine tuning, the reasoning training they had, it was more like on the order, I forgot about like hundred, two hundred thousand dollars, something like that much, much, much lower and that is much more approachable. Now this is for 671 billion parameter model. It's a very, very big model. Now if you go down in size from 600 to let's say 20 billion, you could probably do something really good with a few thousand of dollars and then it again would make sense. But again it's not going to be something on the weekend. It usually requires a few weeks or a few months to get really good results. But once you are confident and can do it, you can then later on swap out the element, can repeat the procedure. So once you, you learn the workflow, it's actually easy to. Yeah, once you get going it's kind of easy to apply your skills in a sense on other LLMs. And with that I wanted to say is yeah, you can do interesting things. And there are also APIs now, I mean again not affiliated with any of the companies, but there is, let me think, I think it's called Thinking Labs, Thinking Machines. There's like a company from one of the OpenAI co founders that has an API where you can fine tune and customize LLM so you don't have to have the GPUs yourself. You can use like a ChatGPT API. You can use an API they have, but for fine tuning you can give it the data and the settings and then run it on cloud machines without having to worry about managing let's say GPU failures and that type of thing. But again I think here it really, really helps to learn how to build it from scratch first to understand what you're doing with these different settings. For example, in my book in the reason build a reasoning model from scratch book. I have a chapter on reinforcement learning with verifiable rewards, which is essentially the deep SEQ reinforcement learning. And I'm only running it on data set. It's called Math500, where there are. Sorry, Math500 is a test set that is popular for benchmarking. But I have like the training set, it is 12,000 math problems that are not overlapping with this and I'm training on that. And just to give you some numbers, it takes about. So there are 12,000, about 12,000 problems. It takes about a day to train for 500 steps. So on one GPU though, only. But if you would do 12,000, it would be 10, about 20 days or something to just train it. And it's a small model, but doing that on a few steps, you understand what is verifiable rewards? What is happening there? What am I comparing against? What are the reference answers? What is it checking? What are the different settings? There's something like number of rollouts, batch size. There are different settings in the trpo, like clip ratios and everything, and like the epsilon clip ratio. There are a lot of little tweaks and knobs. And by building it from scratch, you know what all these types of things mean. And then once you understood, oh, that's what I'm doing, that's what's happening here, then you can, for example, go to an API and say, oh, I'm actually confident I know what the setting is. I used it before. I understand it's just a knob I have to tune. I will try this setting and this setting and you know, like where it helps a lot to build from scratch, to just get that intuition before you have a production system.
B
That's great. So just as we start to wind down the conversation, I did want to ask you, Sebastian, what's your kind of takeaway advice for technology leaders who may be interested in learning more about LLMs or ensuring their teams better understand LLMs? What's kind of the main takeaways you would give them in terms of how they should move forward into this space?
A
Yeah, well, shameless pluck here. I would say doing coding an LLM from scratch, I mean, not from scratch without any template, but for example, my book or something where you do have a guide that guides you through it. And if you are comfortable with, with Python, Pytorch, this is something you could technically do on a weekend or maybe four or five days, and then that gives you like the foundation and then there's a lot of jargon out there, mixture of experts and like different attention mechanisms, group query attention, multi latent attention. But it's all derived from the original GPT model. So once you understand the core building block, it kind of like demystifies all the other things. They are all basically build on top of that or like flavors of that. And I think like, I think it is important, I mean that's what I also like to do is to build a foundation of something and then it's always easier to look something up. That is. And you can see how it evolved from there compared to starting. Okay, I have no idea how anything works. What is this mixture of experts? And I think so that is a good point because I think I have actually there were some prominent people asking me, for example about advice on mixture of experts. Would that be something worth investing in? Like more like a big picture investment as a person who let's say is not building LLMs. And for example the misconception would be, oh, mix of experts. I can have, I can train different LLMs and then I can combine them together. I can have a math LLM and I have a Spanish LLM and then I don't have to train all, everything together. And I think that sounds very plausible. But then if you. It's not how mixture of experts works though, it's a very different thing. And I think by building the foundation it helps you really demystify these misconceptions basically like by. So mixture of experts is essentially a module in the LLM that is like a feed. It's called a feed forward layer. It's just like basically a weight matrix. It's like a classic multi layer perceptron. And if you have that connection, you know, you can't just swap anything in. It has to be trained end to end. And the experts are also more implicit. You can't tell, okay, this is just doing math and this is doing Spanish. It's more like, yeah, it's maybe this one is stronger at math because it gets more activated when there's a math problem. But it's not like this discrete distinction, it's more fuzzy. And I think these are things that are really hard to understand by not looking at the fundamental architecture building blocks. It's really like even right now I'm trying to explain it in a sense where it's more like big picture but big picture doesn't capture it. I think that's maybe my message is like, I think there are a lot of visualizations and big pictures that are very, they're not incorrect. But they're very fuzzy, they're very vague. And then they can lead to misunderstandings because they don't show the full picture. They try to do a big picture. I think really doing like, you don't have to learn about every nitty gritty detail like GPU optimizations. That's like a. I guess it's like a detail. If you don't really train on LLMs, you don't have to necessarily understand NVP4 like floating point 4 position and how that's implemented. But having a big picture you would understand. Okay, 4 bit precision is less than 16 bits. So in that sense it's cheaper. But it also approximates because you can't store as much information and like understanding and appreciating these type of nuances, I would say.
B
Yeah, right. And there's just no substitute to your point for like getting your hands a little bit dirty and.
A
Yeah.
B
Seeing how it actually operates in an environment.
A
Yeah. Because I think that's like the. It doesn't lie, it's the truth. It's like, it's not hand wavy, it's really concrete. And this way you don't have then these types of. I wouldn't say knowledge gaps because. Well, even if you build something from scratch, you don't build everything in all different directions from scratch. You focus on the core. But the core itself, it is the true core. It's not, you know, like a vague concept anymore. And so it does. It's like almost like self verifying, you know, it's. Yeah, it's like when you have math equations and math. Sorry, but like when you derive things from first principles, you. Like there are certain formulas, you can just use them and memorize them or you can derive them from first principles and then you can see if you do that from first principles. Oh, this formula, I mean, you don't have to do it all the time. It's just you do it once but then you know, okay, this is actually rooted in something and these are the parts. And this is why it is. Because I derived it this way. It's not just a fantasy formula that someone came up with. There is like a reasoning or like a. Yeah. Process behind it basically. And then. Yeah. I think that answers a lot of questions than people would have.
B
Well, and it, it feels like this is a space where there's so much misinformation or so much hype and there's so much opportunity for people to misunderstand how it works that there's a lot of benefit to being able to just actually go to the source and see it for yourself.
A
Yeah. One more example just came to mind when you were saying things about the fundamental misunderstanding of certain things. For example, there was this model. I think it's very cool research. It's called the hierarchical reasoning model. And there was also something called the tiny reasoning models, and that came out last year in 2025, and I think they even won the ARC benchmark. ARC is like a logic puzzle type of benchmarks. And it got huge. A lot of hype was huge in every, like, in the media and everywhere, basically. But I think it would have helped if people, like, build something like that or like, follow the paper, because it is a really, really cool model, but it's not an LLM. They compared it to ChatGPT. It's a tiny, like, transformer model, and it is only working on this particular task. You can train it to do, let's say, sudoku. You can train it on these ARC benchmark puzzles, but you can't say, translate my sentence from Spanish to English because it can only do that one thing. And I think here it will help if people, yeah, they understand a bit like, what is the architecture, how is it trained? And then you see these limitations, basically. It doesn't mean it's. It's a bad model. It's actually pretty impressive for its size. But then it wouldn't be fair to compare it to an LLM because LLM can do a lot of things and it is a general model where this is very, very, very small, specific. And I think these are things where if you understand the fundamentals, you can kind of, like, escape this type of hype. You can kind of. Oh, this is. Yeah, this is clearly just a news headline. They're trying to get attention with something and. And I think there's a lot of that, like you said, there's a lot of this type of hype where, you know, it sounds good, it gets like, I guess, hype and clicks and everything. But often, yeah, if it's too good to be true, it's often too good to be true. There's usually some catch, and it's easier to find that catch if you know the fundamentals, essentially.
B
Well said, Sebastian. I wanted to say a big thank you today for coming on to the show. It's been a really interesting and insightful conversation. And, yeah, I was excited how deep you could take us on some of these topics.
A
Yeah, thanks so much for inviting me. I had a lot of fun talking about all these technical things. You know, that's what I do for work, what I do for a hobby. And I had a lot of fun. And so yeah, thanks for inviting me here. And yeah, it was fun.
B
If you work in it, Infotech Research Group is a name you need to know no matter what your needs are. Infotech has you covered. AI strategy, covered disaster recovery, covered vendor negotiation, covered. Infotech supports you with the best practice, research and a team of analysts standing by ready to help you tackle your toughest challenges. Check it out at the link below. And don't forget to like and subscribe.
Podcast Summary: Digital Disruption with Geoff Nielson
Episode: LLMs in 2026: What’s Real, What’s Hype, and What’s Coming Next
Date: February 23, 2026
Guest: Sebastian Raschka – LLM Research Engineer, Author
Host/Interviewer: Geoff Nielson, Info-Tech Research Group
This in-depth conversation explores the rapidly evolving landscape of large language models (LLMs) as we enter 2026. Host Geoff Nielson speaks with acclaimed LLM researcher and educator Sebastian Raschka to demystify current capabilities, ongoing industry trends, and common misconceptions, separating practical progress from pervasive AI hype. The discussion gives both a technical and practical perspective, yielding insights for technology leaders, developers, and anyone seeking to leverage (or understand) the next wave of AI disruption.
(Timestamps: 01:23–05:57)
(05:57–12:23)
(13:28–18:02)
(21:43–29:12)
(29:12–33:59)
(35:26–43:43)
(43:43–48:58)
(48:58–57:59)
(57:59–61:22)
(66:03–73:50)
On 'Reasoning' in LLMs
“Reasoning is in quotation marks…it shouldn’t be taken too literally, like how humans reason.” (Sebastian, 01:53)
On LLM Shortcomings
“Counting the R in ‘strawberry’... you’re not evaluating it in the real use case you care about.” (Sebastian, 07:37)
On Coding Automation Limits
“It’s not making people developing code…obsolete, because it’s still work. You can’t just say, ‘build XYZ’ and it will build a perfect version. Usually, the first version is not the final version. There are iterations, you have to use it, test it, and tweak it…and that is still work.” (Sebastian, 18:02)
On Model Benchmarking
“Each of these methods to evaluate LLMs has its shortcomings…in the end, they all look similar. You have to use it and see what works for you.” (Sebastian, 49:55)
On LLM Development Realities
“It’s not usually something someone can do by themselves…looking at Llama 2 or 3—thousands of GPUs, constant failures, checkpointing, monitoring. This is not a weekend project…it’s a lot of work, and that’s why now it’s mostly companies with resources doing it.” (Sebastian, 57:59)
Sebastian Raschka brings a pragmatic, deeply knowledgeable perspective, urging teams and leaders to get past AI hype and build essential understanding from first principles, while recognizing the real costs, limitations, and best practices for leveraging LLMs in 2026 and beyond. The future is likely to be shaped not by revolutionary new architectures in the short term, but by smarter training, usage, and orchestration—grounded in practical, demystified understanding.