
Loading summary
A
Not all tokens are created equal. And there is a way to look at token value. There are two key factors that impact token value. One is the intelligence embedded in the token, or how much intelligence does the token carry? And the other is how fast does it arrive?
B
Welcome to the Nvidia AI podcast. I'm Noah Kravitz. So I'm here with Shruti Kopakar. Shruti is a member of the accelerated computing team here at Nvidia, and she focuses on inference. And we're here to talk about tokenomics as data centers become AI factories and produce intelligence for the new industrial revolution. This word, tokenomics, has been floated about. It's a useful term, but maybe we can break it down with your help, Shruti, so that it's really something that business leaders can understand and take into practice.
A
Yes, absolutely. Well, first of all, thanks a lot for having me, Noah.
B
Thank you for.
A
I am very excited to dig into the economics of AI or tokenomics. And as you said, it is a term that gets used quite a bit, and I welcome the opportunity to help define it, so to speak. So the way to think about tokenomics is it's about how tokens are valued, supplied, consumed, and monetized. And what that essentially maps to is to token utility, which is all about token value, token supply. And this is where your AI infrastructure decisions are, right? Thinking about what infrastructure to invest in that will maximize your token output while minimizing cost. Then there's token demand. This is where customers and organizations think through what is their number of users, how many use cases, what types of use cases. So really sort of mapping out the volume and velocity of, of the tokens that they need. And then finally there's token monetization, which is taking the tokens and turning it into business value. So those are sort of the four pillars for tokenomics, and it's super important to understand all four of those and how they relate to each other to be able to deploy AI successfully.
B
So let's start at the top then, with utility or value. How do you define the value of a token? Are all the tokens worth the same? Do they have differing values? Is there a better way to look at it? How do you approach that?
A
That's a really great question. And you're right, actually, that not all tokens are created equal. And there is a way to look at token value. There are two key factors that impact token value. One is the intelligence embedded in the token, or how much intelligence does the token carry. And the other is how fast does it arrive, which is essentially the interactivity. So to unpack that a little bit, the intelligence of the token is dependent on the model that produced the token. So more complex, more intelligent models will produce tokens that in general have much more. Yeah, and then it also depends on the context that the model is looked at. And generally speaking, the longer the context that you sort of allow the model to look at, the better the accuracy, the better the intelligence of the tokens. Now I say generally because there are cases where if the context increase too much, then the model quality, the output quality can degrade. But I don't want to rabbit hole into that. Generally more context does kind of equate to better intelligence.
B
Right.
A
So that's one aspect. And then as I mentioned, how fast the token arrives is the token interactivity, which is essentially tokens per second per user. So it's the rate of token generation. And so if you look at token value as a spectrum, on the one hand you have these basic models with shorter context generating tokens at not that fast a speed. And then on the other extreme is these more complex, more intelligent models with much larger context and generating tokens that are really fast. And across that entire spectrum is your different use cases and how you map those use cases to the token value. So of course there is a absolute way in which to think about the token value, but there is also a relative way with respect to the use cases in how to think about it. And we can unpack that a little bit if you want.
B
So is it fair to say that the value of the token is tied at all to the task that it's hand as well?
A
Yes, and that's exactly sort of what I was trying to get at when I said that you have to think about mapping the right use case to the token value. So as an example, we said, like I said earlier, right, that tokens generated by more complex intelligent models are more valuable. But that's in an absolute sense, relatively speaking, your use case may not require that more complex, more intelligent model. And then that that additional value is completely useless to you. One example of this is domain specific applications where in a very narrow context a post trained, which is fine tuned small language model. So a much smaller model can give you just the value you need, in fact in some cases even better accuracy for that given task. So you don't need always the big large models. And so relatively speaking you need to map where on the spectrum of token value does your use case sit. Same thing is true for the interactivity piece, which is Agentic applications absolutely need the highly interactive tokens, but you may have applications like chat interfaces or enterprise search which don't need that level of interactivity. That is very critical when you are thinking through your AI deployment decisions of where to map your use case to what token value.
B
So when a business leader is thinking about demand and thinking about mapping out tokens to use cases, and the different use cases have different values associated with them, what's a good approach for someone who's looking at what their org is doing, what their different team members need, how do they start to get a handle on? Well, how many tokens are we going to need to produce and how many of each kind?
A
Yeah, so use cases are extremely important when thinking through token demand. And there are three layers in which you can think about this with improving levels of forecasting accuracy, if you will. So the basic sort of back of napkin math is look at how many users you have, how many requests or sessions the user is going to initiate in a given day or month, and then how many tokens you need per request or per session. Those three numbers put together will give you your base demand for a single day or a month or whatever is your time period of analysis. Now that is the base and very, very sort of simplistic look at it. There are multipliers that you do need to account for that will dramatically change your understanding of your token requirement. And a couple of those are, number one, are you using reasoning models? As we know, reasoning models use thinking tokens which never get seen by the end user. And oftentimes when AI is deployed, you can actually set thresholds on how many thinking tokens are allowed per interaction. And so when you are estimating demand, you do need to think through are we using reasoning models? What are our thresholds? What do we expect the peak and average to be on those? You know, on that use. So that's one second is agentic. Agentic is a huge multiplier because any use case, if you are deploying it in this sort of agentic workflow context, then there are multiple sort of turns and loops that might happen that can increase your token demand significantly. And then finally, the last factor is something called cache hit rate or the KVCACHE hit rate. And for those listeners for whom this term might be new, KVCACHE is sort of like the short, short term memory of a model. And so anytime an input request comes into a model, it needs to process it. But if it's already seen that input request before, then many times it actually gets stored in the cache, and then when it comes in again, it doesn't need to recompute it, it can just use those cached values. So those are some key factors to kind of look at to get to a higher degree of accuracy when thinking about token demand. And then the final one is demand variability, which is how is your demand changing in a day? Like sometimes you may have products that get used quite a lot in the morning hours, but not so much in the evening, or vice versa. Same thing with seasonal variability. For example, retail providers or E commerce will see a surge during the holidays when they're trying to push a lot of products out. So you do need to think through those. And then of course there is the user growth. So you started with a base number of users, but you as a business are trying to constantly drive up user growth. So you need to factor in how much you expect that to grow as you think through your token demand.
B
So demand of course leads us to supply. How do you start thinking about supply? And you've mapped out sort of your baseline and the conditions that you just outlined. Shruti, how do you go about then translating that into creating the supply necessary to get all these tasks done?
A
When it comes to token supply, that's where a lot of the AI infrastructure decisions lie. When you're making that decision, what you want is maximum token availability, token output, while minimizing your token cost. Now, when you think about cost or total cost of ownership, oftentimes organizations and decision makers can gravitate towards the easily available metrics and what I like to call input metrics, such as the cost per GPU hour or the flops per dollar, which is essentially how many floating point operations are you getting per dollar. These are input metrics because they don't tell you anything about the actual delivered token output, which is a function of much more than just flops or just the memory you have. It is a function of extreme CO design. And so the metric that represents both your input but also the output is cost per token. It's a very simple metric that tells you what is the cost that you're paying for of cost of generating one token. And it's essentially the cost of GPU divided by how many tokens does the GPU produce. So in a way it incorporates both the input and the output and gives you a sense of your true ROI from the AI infrastructure.
B
It's interesting to hear you explain it because it sounds so simple. And I can understand coming from the other point of view, right, of what we're outlaying for the GPUs and the server racks and all the interconnectivity. And so we, you know, we count those costs. But looking at it from the other end, as you said, the output cost just makes so much sense because that's what you're trying to get at the end is the intelligence, the token. And so putting the price on that seems like a really kind of clear way to think about it. Does the cost per token metric vary at all or do you have to think about it differently depending on the use cases? As you were talking about before,
A
cost per token is sort of the base metric. Now, of course, it will vary depending on all the other things, like the model, the context, the intelligence, basically, and then the interactivity. So any tokens that are generated by a more complex model or are more interactive are going to be costliers. Of course, that's just physics. Right. So, yes, it definitely does depend on the models, the context, as well as the interactivity. But you said it really well earlier that ultimately, if the business runs on the output, which is the tokens, it is kind of a fundamental mismatch if you are evaluating infrastructure based on the inputs, but your business runs on the output. And that's why cost per token starts to get at sort of the real roi, because it measures both in many ways.
B
Yeah. So Shruti, as we think through input metrics and cost per token, is there an example that comes to mind that can really kind of bring this idea to life?
A
Yeah, absolutely. In fact, if you look at Nvidia Blackwell compared to Nvidia Hopper, and if you look at just merely the input metrics, which is the hourly GPU cost, that's 2x. So that's Blackwell being maybe 2x more expensive than Hopper. If you just look at flops per dollar, that's also 2x. So Blackwell does deliver 2x more flops per dollar. And that sounds like a huge advantage, which it is, but it also doesn't even scratch the surface of the true sort of benefit and value of Blackwell. And that's because Blackwell, when it comes to delivered output, delivers 50x more tokens per watt compared to Hopper.
B
50x 50x. Fantastic.
A
So, with the same infrastructure footprint, the Blackwell NVL72 system delivers 50x more tokens than Hopper, and that translates to a 35x lower token cost.
B
Amazing.
A
Yeah. And so that really, I think, brings the point home on why not just look at the input metrics, but look at a metric like cost per token, which represents both what you're paying but also what you're truly getting, design.
B
So I'm glad you mentioned. I was going to ask you to go back. You'd mentioned extreme co design. We've talked about it before. Obviously anyone familiar with the space has heard the term, but maybe you can dig in a little bit to what it means, particularly in this context.
A
Yes, I actually welcome the opportunity to talk about extreme co design because we get asked this question quite a lot and so often we get asked, why extreme co design? What does co design even mean? Is it just integration? And people may think that this is just splitting hair or just semantics, but I do think that the distinction is important because when you think about integration, you think about different parts, different sort of independent units that are then integrated post facto. Whereas co design is about designing from the ground up, simultaneously multiple parts of the same system, knowing that they are all optimized towards the same outcome, that of lowest token cost. That's why the word co design is extremely important. And the reason it is called, or rather we call it extreme co design is what Nvidia does, is because of the depth and breadth into which it extends. So it's co design across just compute. No, it's compute. Memory, storage, networking, everything. Everything. I mean the Vera Rubin platform has seven chips, but it goes even beyond that. There's all the software that sits on top, so. So everything from the CUDA kernels to the runtimes to the serving software, as well as all the way out to the ecosystem. Everyone from our Silicon partners, our OEMs, our cloud providers that we work with, the various OSS frameworks that we work with, the core design extends beyond just sort of what's in a system, what's in an AI factory, all the way out to ecosystem. And, and that's one of the reasons why it's extreme. And so anyway, but you asked a question, a more specific question, about what are some of the extreme co designs that help with the cost per token. And I think one important one, which I think you've actually discussed in the previous podcast, is the mixture of experts models, how the Blackwell NVL72 is such a great fit for them because it kind of helps with the inter GPU communication. And then all the software in terms of Dynamos disaggregated serving coupled with any of the runtimes that we support, whether it is tensorrt Vllm sglang doing a technique called wide expert parallel that greatly optimizes the inference performance and then thereby reduces the cost per token for those mixture of experts models. So that's one great example. The other really good example is actually the Vera Rubin platform itself, which is built for the age of agentic AI.
B
Sure.
A
And to really understand why that extreme core design is required, maybe we can look at what an agentic workload is like. Okay, so thinking about an agentic workload, let's draw the parallel to the conversational workload. When in a conversational setting the user prompts something. Say you prompt something and then the LLM says something and then you say something else and then it says it back. So you are taking as a human turns with the LLM with the AI in agentic. It's actually AI taking turns with AI as well as with software. Because main agent can based on the user input decide to do some reasoning, then decide, oh, I need to do a tool call. So call some software. Then it might decide, oh, I need a sub agent or a specialized agent to go do some work. So it's going to take a turn with the specialization till the specialization does its computation comes back with the result. And this just keeps going.
B
We love Agentic for that.
A
That's right. And it's multi turn in a way that has no user involvement other than the prompt the user gave in terms of maybe say, book a ticket to Miami and then it goes through all of this several turns to then finally produce an outcome. And the number of turns involved in agentic is significantly higher than conversational. So the number of LLM calls, like the number of times the large language model is called, is also higher. And in general the token demand and for that reason is also higher. And that's why extreme core design is so critical, because you are using up so many tokens. So you have to lower the cost per token. Latency is really important because on every turn, even a couple of milliseconds more add up to several potential seconds of delay for the end result. And then finally coming back to the Vera Rubin platform, now that we've described the agentic workload, we can clearly see why the core design is required to accelerate the LLM itself or the reasoning and the AI itself. You need the Ruben GPU, you need the Groq3LPX solution and deliver that ultra low latency. You need Vera CPU because it's going to do all this tool calling or sandboxing for code generation and code testing. You need some the CMX platform that we've talked about, which is the Bluefield DPUs together with Spectrum X which allow for the KVCACHE or the short term memory as we discussed to be offloaded when needed so that it can be retrieved when required for a match with an incoming request. And so that's sort of another example of co design, where being able to develop all of these from ground up helps a lot.
B
Right, right. So we talked about extreme co design and you mentioned all the different pieces that go into it building, designing and building from the ground up. Software is a part of that. But maybe you could double click Shruti into how software plays a role and how important software really is.
A
Yes, absolutely. So software actually is the difference between what you get in the real world, the delivered token output and the actual token cost versus what you see on a spec sheet. Software makes all the difference, all the things on the spec sheet. The system design cannot be fully realized unless you have software that makes use of it and delivers really good output. The other important thing about software is that it cannot be piecemeal optimization. You need to have a robust software stack that can turn on, enable every single optimization. So that can do, say, NVF before quantization. It can also do MTP or speculative decoding. It can also do disaggregator serving, it can do the wide Export panel, the KVCache offloading, the KV Aware routing, and on and on and on. Right. To be able to stack all of those optimizations together is really important because that is what gets you the 50x, the 50x more throughput that we see with Blackwell and the 35x lower token cost. And so software is a huge, huge piece of that story, for sure. The other thing about software is that it never stops. Open source software especially, it never stops. And it's not just the Nvidia team that is building the software and it's the entire ecosystem. It's all the OSS frameworks, all of our partners, customers, the developer community, and every small optimization that they do, that's a drop that just keeps adding and adding to this massive ocean of advantage that is the Nvidia ecosystem. And so just as an example, on Both VLLM and SGLang, which are these inference runtimes, we've seen 8x more performance in just about six months. And that's huge because from the same infrastructure footprint you're getting so much more token output and that's driving down your token cost as well. So absolutely, software is a huge, huge piece of the puzzle.
B
So we're through three of the four pillars. The fourth one, perhaps the big one, monetization. How do you talk about monetization? How does a business leader Think about. Okay, so I understand the importance of extreme co design. I understand the different value of tokens in different situations and different tasks. And Gentic is wonderful, requires more tokens. All the things you've elucidated here, how does a business leader think about monetization of the tokens?
A
Right. So when it comes to monetizing tokens, there are various different ways in which you can go to market, but one of the best proxies is to just think through it as you're generating tokens and then you're selling the tokens. And so when you think about selling the tokens, how much do you sell them for? And it's sort of a classic exercise in figuring out your pricing, which is a, you need to think about what is the cost to produce the token, which is this lowest token cost that Nvidia is helping with. Right. But you do need to understand what is your token utility. And given that token utility and token value, what is your cost to produce the token? And you obviously want to charge more than that. Right? So there's that. Okay, so that's cost based pricing.
B
Right.
A
And then you also obviously have to think through value based pricing, which is essentially how much is the willingness to pay, what is this sort of token utility, how valuable is it to the people who are going to pay for it? So you do need to take that into account. And then finally, before you think through the pricing, you also need to think through what is the demand distribution. Because ultimately there are revenue goals and kind of profit margin goals that you are working towards. And so to land at a place that you like, you do need to think through where will your sort of bulk demand be and how will the demand taper off when it is, say, for tokens that are not as much utility, there may not be many takers, but in the same way tokens that are highly valuable, there will be fewer people who are willing to pay the premium for that. You do need to account for that sort of demand distribution. And then with those three things, you can figure out what the pricing for each token can be and then deploy AI successfully. So the key thing here though is that pricing the tokens again is obviously just one proxy. There will be customers who are building value added services on top of those tokens.
B
Sure.
A
So like customers building AI native products or something like that. And then in that case the process is similar, but you do need to think through then what is the additional value you are adding on top of just generating those tokens as well.
B
So going back shree to something I was thinking about when you were explaining extreme co design. When you get to a point, kind of a sweet spot where your infrastructure is humming and the cost per token is low, does that mean that ultimately you won't need as many GPUs to produce the number of tokens that you really need, or what happens in that kind of a scenario?
A
Right. This is a great question. And what we see here is the classic Jawan's paradox, which is essentially, you would think that, okay, the GPUs are way more productive, they're generating so many more tokens, do you need less of them? And the answer is absolutely no. And the reason is, as you see the efficiency, new use cases get unlocked and people just figure out we have all this thriving research community, data scientists, ML engineers, they just figure out how to just use up that efficiency, how to absorb that efficiency and do more with it.
B
Right. People aren't going to run away from intelligence, they want to use it.
A
That's right. And if you look at the macro package that we've seen so far, it's very telling. So when generative AI became a thing and people were sort of generating summaries and images, that was great. Then we lowered the cost per token and instead of needing less GPUs, they needed more GPUs and more tokens. Why? Because test time, scaling and reasoning. And so our researchers figured out that bias scaling at test time, we can generate better, accurate, more intelligent responses. And that was valuable for the use cases. And so that happened. And it didn't just happen once. We are seeing that again now with agentic, now that we've figured out how to deploy these mixture of expert models, reasoning models efficiently and lowered the cost per token for those significantly. Now here comes another inflection point where it's like, hey, we've got more tokens, let's do more with them, of course. And so that's where the agentic revolution is happening. And so definitely it's Jevons paradox in action at the macro level. And I've also seen this play out at sort of individual customers. So that's a great question.
B
So, Shruti, can we kind of ground this in some examples of how businesses, organizations that you've been working with are putting all of this into action and really extracting value from the tokens and using them to build.
A
Yeah, absolutely. So when you think about taking tokens and turning it into business value, there's four primary, I would say, business models, so to speak. Okay. Number one is what we just discussed, which is selling tokens directly and a lot of Nvidia customers and partners are doing this. And the examples are fireworks and base 10 together. AI deep infra. There's just so many of them and all of them are helping their end customers build valuable services on top of the tokens that they are selling. So that's number one. Number two is AI native companies who are building products from the ground up with AI in it, sort of permeating through it from day one. And those are customers like Perplexity or Cursor who have a coding engine and many, many, many others. So that's sort of the second model. Third is you might use AI to enhance your existing products and infuse AI through your existing products. And again, lots of different examples. We have Shopify, we have Airbnb, there is Adobe. In fact, a lot of them are doing both. They are building AI native capabilities, but they're also using AI to improve their existing products. For example, Adobe is, you know, they've built their Firefly family of models and then they're using those models to infuse new capabilities into Photoshop, for example. And then the final bucket is pretty much every organization today which is trying to improve their internal operations, their internal processes, improve employee productivity by deploying AI. So they are not necessarily sort of deploying external customer facing products or services, but these are internal to their own operations. Again, NVERIA is working pretty much with everyone across the board on something like that as well. So those are the four key ways. I'm sure that there are others that are more nuanced that I missed, but that's a useful framework to think about how to take the tokens and turn them into business value.
B
Right. So for the business leader who's listening and comes away from this with a better understanding of the pillars and how they relate and, and really what tokenomics, right? What the cost of a token is and how you ascribe value and everything. How do they get started putting this into practice? What advice would you leave them with for thinking about how to put this into action in their own organizations?
A
I think the best place to start is to first just think through what is the final outcome. And usually that starts with your customers, whether they are external customers or your own internal employees and internal processes. That's immaterial. You have to start back from the customer need, from the use case, because as we discussed, the use case actually dictates a whole lot. The user and the use case dictates what type of model will you use? It dictates what type of context lens you might need to support. It dictates what type of interactivity will you need. So the intelligence and interactivity and then those factors are what dictate what type of infrastructure units. And then of course we walk through the key metrics such as cost per token when making those infrastructure decisions and then that's the supply. So you essentially walk back from token utility and token demand, think through token supply and then once you have a handle on all three of those, you think through your monetization strategy and then go to market and then you fly
B
customer first work back from that. Easy enough. Shruti, thank you so much for taking the time to join the podcast and really break down tokenomics in a way that I think listeners viewers can extract so much value from to talk about extracting value. It's a really comprehensive but yet really easy to follow and understand start to finish how this all comes together. So thank you. Thank you again.
A
Yeah, thank you for having me.
Date: May 21, 2026
Host: Noah Kravitz
Guest: Shruti Kopakar, Accelerated Computing Team, NVIDIA, Focus on Inference
In this episode, Noah Kravitz speaks with Shruti Kopakar about the emerging discipline of "AI tokenomics"—how the generation, valuation, and monetization of tokens (i.e., LLM outputs) are reshaping business models in the age of AI factories and accelerated computing. Shruti unpacks the four critical pillars of AI tokenomics: utility/value, demand, supply, and monetization, guiding business leaders on how to maximize business value by understanding and managing tokens at every step of the AI deployment lifecycle.
Shruti breaks down "tokenomics" into a framework for understanding how AI outputs ("tokens") are valued, supplied, consumed, and monetized in a business context.
“So the way to think about tokenomics is it's about how tokens are valued, supplied, consumed, and monetized. And what that essentially maps to is token utility, token supply... token demand... and token monetization.”
— Shruti Kopakar [00:58]
“Not all tokens are created equal. There are two key factors that impact token value. One is the intelligence embedded in the token... [and] how fast does it arrive?”
— Shruti Kopakar [00:00], reiterated at [02:21]
“[In agentic workflows] there are multiple turns and loops that might happen that can increase your token demand significantly.”
— Shruti Kopakar [06:32]
Shifting to Output Metrics: ([10:00], [12:09])
Model Complexity Affects Costs: More intelligent and interactive tokens are costlier to produce.
Quote:
“If the business runs on the output, which is the tokens, it is kind of a fundamental mismatch if you are evaluating infrastructure based on the inputs.”
— Shruti Kopakar [12:09]
Example – Blackwell vs. Hopper:
“Blackwell, when it comes to delivered output, delivers 50x more tokens per watt compared to Hopper. ...That translates to a 35x lower token cost.”
— Shruti Kopakar [14:03]–[14:20]
Definition: Not just integrating components, but designing compute, memory, networking, storage, and software as a unified system optimized for lowest cost per token ([14:38]–[17:42]).
Scope: Extends from hardware (CPUs, GPUs, DPUs) through software layers, including CUDA, runtimes, serving infra, and ecosystem partners.
Examples:
Agentic Workloads:
“In agentic, it's actually AI taking turns with AI as well as with software.”
— Shruti Kopakar [18:49]
“Software actually is the difference between what you get in the real world, the delivered token output and the actual token cost versus what you see on a spec sheet.”
— Shruti Kopakar [21:10]
“There will be customers who are building value-added services on top of those tokens.”
— Shruti Kopakar [26:11]
“It's Jevons paradox in action at the macro level.”
— Shruti Kopakar [28:52]
“A lot of Nvidia customers and partners are helping their end customers build valuable services on top of the tokens that they are selling.”
— Shruti Kopakar [29:07]
“You have to start back from the customer need, from the use case, because as we discussed, the use case actually dictates a whole lot.”
— Shruti Kopakar [31:46]