C (2:19)
All right, Matt, I've read the YouTube comments and this time I want it so you do not cut me off with the music too fast. Okay? Good. Right. All right, let's go. This is this week's better offline monologue. And I'm A lot of you have been saying you want me to do something about Sora, and if I'M honest, I haven't wanted to because I find the whole thing so utterly pathetic. A few weeks ago, OpenAI launched a half baked social networking app attached to a compute intensive video and audio generator and people immediately began to do two things freak out and generate as many copyright violations as humanely possible. All because of OpenAI's original plan was to ask copyright holders to opt out of having their content presented in these videos. Sora spent several days covered in Nazi spongebobs and Pikachus with guns before multiple Hollywood talent agencies, along with the estate of Martin Luther King Jr. Intervened and complained, leading to OpenAI creating, to quote NPR, an opt in policy allowing all artists, performers and individuals the right to determine how and whether they can be simulated. With OpenAI blocking the generation of well known characters on its public feed and offering to take down material not in compliance. It's unclear what happened with Nintendo, but I imagine one of their 70 million lawyers attacked and now we've got that out of the way, let's talk about Sora itself. I understand a lot of the people who listen in film and tv, they're kind of scared. And I understand that you've seen a few clips that look kind of sort of realistic, and that this, especially if you're in the creative arts, is quite terrifying because your mind naturally assumes that these clips can be strung together into some sort of coherent whole. This isn't the case every single good, and I use the term loosely, Sora video is cherry picked for many, many, many terrible generations. Every time you use Sora is random. It doesn't matter how specific your prompt is or however many times you've used it. Sora is effectively a giant video and audio slot machine. You can never, ever guarantee that Sora will generate something useful and as a result can never really budget for using it. The human eye is remarkably demanding, and little visual inconsistencies between scenes will make people feel weird and uncomfortable. Imagine that extrapolated to 10 or 15 seconds at a time, and how difficult it will be to get something that makes visual sense before you have to think about things like does this connect to the rest of the footage I'm using? Okay, so the majority of actual professionals who would use Sora would not be using the app. They'll be connecting directly to the model on OpenAI's API. It's just it's not done via a classical app interface. Now then there's the problem of cost. This is where you really need to start worrying if you're building things with Sora, so Let's start off with the first problem. Cost. So OpenAI offers two different Sora models, Sora 2, which they say is designed for speed and flexibility and is ideal for the exploration phase. And that costs 10 cents per second. And then there's Sora 2 Pro, which is either 30 cents or 50 cents a second, depending on resolution. And I quote, it's the thing you go to for production quality outputs. So you're either spending 1, 3 or $5 for every 10 seconds of footage. And like every generative model, the longer you generate the. The likelihood of hallucinations, which in the case of SORA means bizarre animations, inconsistent details, or just flat out useless crap. Then there's the problem of time. OpenAI's own documentation says that a single render may take several minutes. At the end of those several minutes, out pops a video that may or may not be of any use. OpenAI allows you to remix using more prompts, which allows some iterative development. But these remixes also cost money and also take several minutes. So let me walk you through a scenario. You're making a short film, let's just say it's 15 minutes long, which is 900 seconds. You ask Sora to generate a man putting on a hat, your first eight generations, each taking four minutes and $5 a piece, which takes about 32 minutes and $40. They don't really do the job, so you do two more, taking another four minutes apiece and 10 more dollars. You finally, on the next try, get something kind of useful, which cost you another $5, and then you realise you wanted him to wear a specific kind of hat. This happens all the time when directing stuff. There are minor changes you make that you realize when you're finally in the moment would look or sound or be better. So, yeah, that doesn't go so well with probabilistic models. So, shit, fuck, you gotta do something. So you remix him another four minutes, another $5. Fuck, wrong hat. Four minutes, $5, right hat. His hand blends through it for some reason. Okay, four minutes, $5. The hat's right. But when he puts it on, his eye blinks. One of his eyes just blinks three times for some reason, so you can't. Okay, four minutes, $5. Looks kinda good. Different hat. Again, four minutes, $5. Hmm. You've now spent $80 in over an hour generating a man trying to put on a hat. You're not really much closer to having useful footage. And because as you remix it again and again, SORA keeps making these little errors, because that's how these models go, it's impossible to tell whether the next generation will be the one that works or whether Sora will spit out some new little fuck up. So the more intricate something is, the more expensive it gets. But you know what? You can find money places. You can't find more goddamn time. I guess you could have a separate computer running more, but that's still going to cost a bunch of money. How many of these slot machines are you going to run at once? How many times are you going to allow them to edit? How can you have a coherent vision when you've got multiple people generating things? You can't. But you know what? Perhaps. Perhaps the next generation will be great. Or perhaps it will be dog shit. You have no way to know because that's magic of generative AI. Yet these problems compound aggressively once you need any kind of visual consistency. The man now has to put the hat on and leave the house. How does the house look? Is the hat the same? Does he have wallpaper on his walls? Is there anyone else in the house? What kind of table? Two chairs, One chair, five chairs? How do you possibly keep all of these things consistent? You don't. You can't. That's part of what makes Sora so goddamn awful. It's built specifically to make you scared of him, to create superficially impressive clips. Brain dead Hollywood executives can claim it's the future. Yet in a practical sense, it's impossible to budget or plan or guarantee anything about what Sora might do. And this is pretty much across the board for these generative models of making video and audio. Now I've heard from a few people that Sora is cheaper because it doesn't involve labour, which is something you could say only if you believed SORA would give consistent outputs. And really the only thing that a probabilistic model like SORA can do is guarantee inconsistency. Even by Hollywood accounting standards, a generative tool that will cost hundreds or thousands of dollars to generate 10 seconds of shitty footage that is impossible to coherently connect to more footage is a really terrible idea and also very inconsistent in its costs too. And like I said earlier, there's the issue of time. Every single entertainment product requires some sort of time budgeting. And it's impossible to say how long it will take Sora to generate something. OpenAI doesn't even specify what several minutes means, meaning you can't really plan a production using it. Sora isn't cheaper, Sora isn't easier, and SORA certainly isn't more efficient. But you need to remember also that generative video models have been around for over a year and they're not really seeing mass use. Now, if this thing were capable of making anything truly useful, you'd see it everywhere right now. But you are seeing a little bit of it. And I do want to address that. You probably saw Kalshi's ad and heard that it cost $2,000 to make and took only a few days. But I really encourage you to look at the actual commercial itself. It's completely incoherent, nonsen, each shot completely disconnected with weird glitches and animations in the crowds. And one point towards the end, a woman is meant to say okc, but the C part does not map to her mouth. It looks really bad. And the only way you could get away with something like this is having these quick hit shots. And also please go and view the comments about this that people just rip the fuck out of this thing. But nevertheless, it was made using VO3, Google's generative video model, and it apparently took 300 to 400 clips to get 15 usable shots stitched together using traditional editing tools. Now, the reason this cost two grand is that it sucked. And the reason you're not seeing more advertisers do this is because it's impossible to make a coherent video out of this footage. I realise most commercials you see on TV may feel chaotic or kind of bland, but they're remarkably precise. And the generative shots used for the Kalshi commercial are chaotic and fail to convey any real meaning beyond a person yelling Indiana or okc. The only reason it cost so little was one guy put several days of prompting it to it and the end result was shitty. And Kalshee didn't mind because this was a publicity move. Kalshi put out the commercial specifically so the media would write it up. And they succeeded because the media loves to feed on scary stories like AI is going to replace human actors. Since The Kalshee ADs, PJ Ace who made it, has made a few others a Popeyes rap one where again, go and look at the comments. I'm not linking to it by the way, I don't want to send them any fucking traffic. But the Popeyes want people just responding saying this looks like shit. What is. What is this? It's incoherent, it's inconsistent. But the funniest one I found was David Beckham's iM8 health supplement ad, which ends with a shot of the bottle of the product with a bunch of garbled generative texts. It does not appear that PJACE has got a ton more work than this, probably because the outputs kind of suck and brands really do not like inconsistent things. And also a health supplement from David Beckham. Jesus Christ, just say it's a private equity firm. Anyway, to conclude, I also want to be clear that the rates for these videos are heavily subsidized by Big Tech, just like every other generative AI product. While Sora might cost 30 or 50 cents a second right now, once the AI bubble bursts, these prices will either skyrocket or these models will cease to exist for public consumption. The biggest clue I can give you is that Google only allows you to generate four or five VO3 videos a day on their $250 a month Gemini Ultra plan. That suggests that Google's video costs are brutal and that OpenAI is burning money by the bucketful to let you fuck on the Sora app. I don't recommend you do that, but if you have just know you're burning a hole in Clammy Sammy's pocket. I will add that you may worry about these models getting better. While they might be more nuanced in their ability to generate video in five or ten second bursts, their ability to generate longer or consistent videos is inherently impossible due to the probabilistic nature of transformer based models. In simple terms, these things are rolling the dice every time. The way you prompt them is what makes them generate and they don't have minds or thoughts, they're just rolling the dice every time on whatever you say and trying to interpret what you mean. Human beings, by the way, are extremely magical. I think you you really underestimate how amazing people are. When we direct someone on a film set, even like an assistant director, that person keeps the production moving and makes sure everyone gets what they need and pushes back on a director when something might be impractical. A director is a visionary, but also an actor is someone that takes interpretation and then is directed to do different things. But that direction is not a fucking prompt. Move your elbow. Look. Look at this way, look that way. The things that operate on a film or TV set are inherently different to just plugging words into a model. And I get em. I get everyone in Hollywood who's scared right now. I get everyone in creatives, in creative arts even who is scared right now. I feel for you. These people are losing. These people are losing. This stuff does not work. It's inconsistent, it's incredibly expensive on subsidized rates and in the end I really, really believe that once the bubble pops. These things are going away. Thank you so much for listening. Reach out if you have any thoughts. I always love to hear from people. E zetteroffline.com I love getting your emails. I love getting your your weird little missives on Reddit. I really, I'm truly blessed and I love you all. I love how many of you listen. I love how communicative you are. It's been a big week with the Anthropic exclusive. And yeah, I'm gonna have radio better offline next week as well.