
Loading summary
A
Okay. We're here in a studio, remote studio, with Mikhail Parakin, CTO of Shopify.
B
Welcome. Thank you. Welcome.
A
I don't even know if I should introduce you as CTO of Shopify. I feel like you have many identities. You led the Bing ML team, I guess, or ADS team, I don't know. People variously refer you as like, CEO or. I don't know what the previous role of Microsoft was.
B
That was. Yeah, my previous role at Microsoft was. I actually was the CEO of one of Microsoft's business units, which included, as I discussed, all the things that people like to laugh about, including Windows and Edge and Bing and ads and everything.
A
Yeah, yeah. What a wild time. You've obviously done a lot since you landed at Shopify. One of the reasons I reached out was because you started promoting more sort of internal tooling. Primarily tangled, but also a lot of people have seen and adopted Toby's qmd. And obviously I think Shopify has always been sort of leading in terms of engineering. I think it's just more recent that you guys have been more vocal about your sort of AI adoption. Is that true?
B
Well, I think AI tools in general have fairly recent development and Shopify, at this stage of its development, we're developing AI in house and building tools that use AI and, you know, interfacing with the wider AI community, you know, are on the sort of the runaway trajectory. So it just did. By sort of natural byproduct. We talk about it more also. We just. Just even yesterday, Andre Karpathi was famous in tweeting about, oh, there's some ways that you can organize your agents to store the data and then look up the data so that you don't have to research or lose context every time. And. And a little bit tongue in chica. I tweeted that, hey, we've done it much earlier and we even have different approaches Tobia and I. Toby, of course, is a big fan of QMD and I'm more of a SQLite fan. But, yeah, very similar things that we already done here. The point is, yeah, we're very dynamic, explosively growing company and we have to be at the forefront of AI adoption, obviously.
A
Yeah, yeah. Your team kindly prepared some slides, actually, that we were going to bring up on to the screen. I think I can screen share and then we can kind of go through some of the shocking stats that Pipi maybe put some numbers to. What exactly is going on. So here we have an internal AI tool adoption chart. What are we looking at here?
B
Yeah, this is very interesting statistics. This is Number of daily active workers. Think of DAO, basically the active users of AI tool as a percentage of all the people in the company and then different AI tools. And you could see two things here is that one is the green is total. Green is just total. So you could see that it approaches really 100% by now. It's hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that everything took off and started growing. Many people noticed that that thing is that small improvements accumulated into this big change in December roughly time frame. The other thing I would claim you could see is that CLI based tools and tools that don't require you to look at the code becoming more popular. And you could see various versions of cloud code and codecs and PI and internal development tools taking off. Exactly. Blue is our river, just internal agent for coding where tools that require IDE such as GitHub, Copilot or Cursor, they're not exactly shrinking, but they're not growing as fast. Like Red Line is the IDE kind of tool. So you could see that they're not experiencing as fast of a growth.
A
As I understand it. Basically every employee has their choice of choose whatever tool you use and then you're just kind of doing a daily survey or something.
B
Exactly. And we the push is to get your job done. You can use any tool and we effectively fund unlimited tokens for everybody. We do try to control the models that people use, but from the bottom, not from top. Like we basically say, hey, please don't use anything less than Opus 4.6. Some people end up using GPD 5.4 extra high. Some people use Opus 4.6. There are plus and minuses and going full 1 million context window versus not. But we try to discourage people from using anything less than that.
A
Yeah, yeah, got it, got it. I mean that's, you know, the next chart here, it really kind of shows the expansion and the sort of December 2025 inflection. Right. That people are using a lot of tokens. I think it's also really interesting that no one was kind of abusing it in 2025 comparatively to this year. There was almost no growth. I mean it's still like probably gave 50%.
B
This is just a different scale. It's still exponential growth, just a different rate of expansion. There was inflection point. And Sean, I would claim the super interesting part here is that you could see that the distribution becoming more and more skewed. The top percentiles grow faster, so that means the people in the top 10 percentile, their consumption grows faster than 75 and so forth. So the distribution skews more and more towards the highest users, which is. I don't know what it tells me. It feels not ideal, to be honest. Maybe it's okay, we'll see.
A
Why does it feel not ideal? Is it because of quantity over quality or was the concern because take it to the limit.
B
That means if this rate of separation continues. Yeah. There will be one person consuming all the tokens. Kind of strange.
A
Yeah, I mean, I think internal teaching and all that will help sort of distribute things more widely. But in the early days, of course, the people who are sort of more AI pilled will obviously find more ways to use it than the people who are less AI pilled. Maybe let's call it that. I'll just kind of quickly pause from the. You know, we'll go back to the rest of the slides. But I just want to review, you know, there are a lot of CTOs of large companies like yourself where they're all considering some kind of token budget. Right. Like, I think it's something that Jensen Huang has been talking about where, like if your 200k engineer is not using 100k of tokens every year, like they're under utilizing coding agents. Of course, Jason Huang would say that. But like, it seems a very quantity over quality approach. And some people are basically saying, well, is this comparable to judging engineer quality by lines of code? Which we also know is kind of flawed, but better than nothing. So I don't know if you have a sort of management take here on how to view this kind of metrics.
B
I mean, you're baiting me. This is my favorite topic. If you let me, I'll probably talk for two hours on just this. I have a lot of things to say. I do think Jensen gotten a lot of bad press saying, oh, of course you're, you know this. The cake seller says we don't eat enough cakes, you know, like, of course. But I actually think that's undeserved. I think he's actually right.
A
I do think he's directionally correct.
B
Yeah, he's directionally correct for sure.
A
Who knows what the right number is?
B
The thing that I do want to say, and this is something that we learned through trial and error and very important, is like two things. One is that it's not about just consuming tokens. You can consume tokens and in fact the anti pattern is running multiple agents. Too many agents in parallel that don't communicate with each other. That's almost useless compared to just fewer agents and burns tokens very efficiently. Setting up the right critique loop. Especially with the high quality models where one agent does something the other one, ideally with a different model critiques it, suggests ways to improve it. The agent redoes it with this critique. So it takes much longer. So people don't like it because latency goes up. You know, they. They have to wait until this debate is happening. But the quality of the code is much higher. And another thing just since you mentioned like look. Yeah the overall budget is just like lines of codes. Lines of codes are exploding for everybody right now. Partially because AI is really more verbose but partially just because AI can write a lot more code, you know, doesn't get tired. And so you have to have to have a very strong narrow waist during PR review otherwise just the number of bugs will go through the roof. It's this unexpected consequence of the just volume trapping everything. I would claim by now good model writes code on average with fewer bugs than average human. But since they write so much more of it, more of it will make it into production.
A
So you have to still have more bugs.
B
Yeah, have to have very rigorous PR reviews also automated of course. But yeah to spend a lot budget there like this. For me actually the important metric is the ratio of budget spent during CO generation versus spent expensive tokens like GPT 5.4 Pro or DeepThink from Gemini checking on primary reviews.
A
Yeah, totally. I noticed in your chart you didn't have any review tools. Do you just use let's say a Claude code to review tools or do you have another set of review tools like the grep tiles, the code rabbits. Devin review also has a review tool. I don't know if you've had those specialist review tools.
B
You are a little bit jumping on my toe right now because the graphs I was only showing public tools. I haven't found a good PR review tool that does what I think should be done. And partially my thinking is because it's so it just goes against both what people feel like emotionally they prefer and some of the, you know, frankly even business models that the companies run at par review tool time. You want to run the largest models. That means codecs or cloud code is not going to cut it. You need to have pro level models if you really want to stand the tide of bugs going into production. And you need us to spend a lot of time the models taking turns but you don't want like a big swarm of agents. So in fact you end up in a different dualistic world where you generate not that many tokens. You in fact generate few tokens, but it takes a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that's why I feel like I haven't found good tools. So we are using our own for PR review for now?
A
Yeah, yeah. I mean I think a lot of companies are building their own, especially to their needs. Right. You also have a chart here going back to the slides on PR merge growth where we're now at 30% month on month rather than 10%. And also the estimated complexity is going up. This is productivity because presumably there's more stuff going into the code base and more features getting worked on. I'm curious about the backlog. I actually don't mind a pro level model taking an hour, two hours to review my PR because I've dealt with humans who take a week to review my pr. Right. And I keep pinging them on Slack. Hey, hey, review my priority. So I think there's some trade off here where it still does make sense.
B
Exactly. That's exactly my point, that on one hand you can tolerate longer latencies at pr. On the other hand, right now the real problem is not in spending time waiting for pr. The real problem is since there's so much more code than probability of at least some tests failing going up and then you keep defining, then you have to find the offending pr, evict it, retest it without that print and so deployment cycle becomes much longer. So it actually in terms of the overall time to deploy, it's total time savings. If you spend more time on a longer model like thinking for an hour because then you don't have to spend all that time during testing and rolling back the deployment.
A
Yeah, totally. That's still worth it. Don't look at the individual, look at the aggregate and look at the change in the aggregate system.
B
Exactly.
A
I'm kind of curious if like there is this PR mentality and the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if git is the problem. Right? Is that the bottleneck. Is the concept of a PR a bottleneck? Do you guys use stacks, diffs? I don't know if that's like a merge queue, stack diff type of thing.
B
We use stacks, we use graphite. We worked with graphite. A lot of so we use stack PRs. I think that's clearly the overall CI CD in general. And interaction with the code repository right now is clearly the main issue and the bottleneck for us. And highest top of mind, I would say we probably need a different metaphor or different whole design of how to process it. In New agentic World, I haven't seen anything dramatically better yet. I think everybody right now is just trying to keep their head above the water because there's so many PRs and then everybody's CICD pipelines start creaking, the times are increasing, the number of bugs slipping by increasing and you have to clap on down. And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what could be completely different in New World? Which I know some people working on it. I haven't seen something like anything super compelling yet. But clearly the old thing were designed for humans will need to be morphed into something new.
A
One other thing that I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in human organizations we do have something like that. It's the company standup. But other than that, it's actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy. It's okay. Not every delivery is atomic consistency. We're not dealing with a database sometimes.
B
This is a very good point because since humans don't write code too fast, you know that global mutex is not too bad. Once you start writing code at the speed of machine, it becomes the bottleneck. Then what do you do? Maybe, and I can't believe I'm saying this because I'm lifelong opponent of microservices and I always thought that was like really bad idea. And now that you're saying it, maybe new guys like microservices will make a comeback because then you can ship things independently in tiny things and managing all that complexity automatically will be much easier. I don't know. We'll have to see.
A
Yeah, I mean, I don't know what the Microsoft or Shopify thing is, but I read this paper from Google where they have a monorepo that deploys into microservices. Right. And then the other concept that I think about a lot is the chaos monkey concept from Netflix. Being able to create this robust system where you have the service discovery, you have the independent microservices discovery, and probably going to be a fair amount of duplication that's how an organic system sort of scales that you have that I don't know, how do you call it? Slack Robustness, duplication. I forget the. These are not exactly the terms I'm looking for, but I can't really think of the words. Okay, I was going to go into tangent and tangle. So we sort of discussed the overall stats that Shopify has. But I think some pretty cool stuff that you guys are working on is your ML experimentation and your sort of auto research training pipeline. Presumably you're much closer to this one because it's a sort of personal hobby of yours. How would you explain them together? I thought we have a slide that has the system diagram.
B
Yeah, Tangle first and then tangent as a thing on top of tangle. And tangle is the third generation claim of systems of running any data processing but bit with a SKU for ML experiments, but not necessarily any sort of data processing tasks where you need to iterate share and you have scale so that you want maximum efficiency. You know how like normally you would work, you would imagine you're a data scientist or an ML practitioner. You would get jupyter notebooks or maybe you would get your Python scripts and you would munch the data and you produce those TSV files and you put them in some JFS or something. Then you would notice that, oh, it has this weird missing values. You go and write another script that goes and replaces them with the dash S and then you run some oh, I need to filter bots. And so you run some light GBM model that removes the bots and then you kind of get into shape and then you start experimenting and you run multiple experiments and then you're like, oh my God, like this experiment is worse. You undo and you cannot get to previous result. And like what did I do back then? You finally like get everything working, then start throwing it over the fence to production. You replicate it. Those things don't work. And then sometimes you don't notice that you forgot some feature naming and the features don't match. But then imagine you did everything and then six months later you have to repeat it because now there's more data or you wanted to do another pass and you're like, what did I do? This script crashes now or the path has changed and then you're trying to. You spend another month, month just doing digital archaeology on your own history, right? Now multiply that by many, many teams. Now imagine you got an intern that you want to ramp up. Now you have to show that intern. Oh look, here's the folder. There's the scripts. Ask your cloud agent to figure it out and then cloud agent does something and then you're Ah, yeah, right, right. It was the wrong folder. I forgot to tell you. I actually had this other thing I forgot myself and that's the daily life, we all know it. If you're a data scientist, machine practitioner, machine learning practitioner or even any data managing person.
A
Yeah, so I used to do this on the quant finance side in my hedge fund. So we did this before airflow and then obviously airflow came along and then more recently Daxter I would say is like in my mind what I would use for that shape of problem where you had to materialize assets and create a pipeline.
B
And that's very good segue because so airflow is great, but airflow is more about you have something and you want to repeatedly run it in production on schedule. It's less about you as a team developing things and being able to share and you grabbing the standard pipeline and saying hey, I want to change this tiny little component and like huge sea of data processing. And I don't want to, I want to run 10 experiments on this and I want to do hyper parameter optimization. All that is very hard to do with airflow. It's very easy to do with tangle. Tangle is more about. It's everything about group of people running experiments. It might be agents too nowadays running experiments, cheaply, collaborating, sharing results you don't need to understand fully. You grab, you clone somebody else's experiment or somebody else's pipeline, change small piece, run it, get it to production state and then ship in one click. So then you don't have to port it into any other system to run in production. You can just run the same experiment. It's fully production ready and it has lots of again as I said, it's third generation system. Original one was, I would claim there was Ether and then at least in my career, Ether was the first that pioneered this type of approach. And then there was Nirvana which at Yandex which did kind of second take on this. And now this one aggregates the learnings from all of those and airflow as well to get to the state where you try it. It feels kind of magical because now everything is based on content hashes. So even if the version changed, but if the output didn't change, nothing is being rerun. It's very efficient if you multiple people start experiment that needs the same sort of data pre processing, it's not repeated multiple times, it's automatically done only once. If you start 10 experiments that all require, you know, some, some data preparation. First is the first step and you don't have to coordinate for that. Like you don't have to know that other people are starting it. You now it's very easy composability, any language you want to use and it's very visual so you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to and share clone and everybody knows also it's full there kind of static in the sense that when you rerun it second time it will exactly have the same results. Like you will never have to do digital archaeology. So full versioning and everything is also there.
A
So people can. It's open source, go to the GitHub repo and check it out. And there's also a really good blog post about it. I think all this is like really appealing. The thing that I think sells me the most about it is that sort of development to production transition. Right. Which I think a lot of people haven't really solved that strictly right. Like we develop really, really well in Python notebooks, but then there's obviously not a sort of production ready process. I think that any way in which that is solved I think is very appealing. Then the other thing that you mentioned which also raised my eyebrows was content based caching, which you mentioned is very much sort of efficiency measure about just like recalculation, only on sort of content addressing, which I think makes sense. It surprised me that the savings could be this much. But maybe I just haven't worked at your scale where there's so much duplication that people just rerun because they change a single ID upstream.
B
Yeah, but it's not only you rerun. The main savings are coming from the fact that you run it, you got your job done and you moved on. Then somebody else in some department you don't know existed runs the same task. But on the newer version right now you can't in most of the organizations can't even find out about it. So that you can't even measure that you're spending that time twice right here. If everybody's entangled, that's detected automatically and detected that the output is the same. And then for that person all it looks like is experiment just suddenly moved, jumped forward. Right. So that's because. Because there's network effect of multiple people helping each other. Yeah.
A
This is one of those things where it's designed to be a platform from the beginning rather than an individual developer's tool from the beginning. Right. And everything's going to streams down from there. That is the sort of Tangle Orchestrator and it manages jobs. We've seen a few versions of this and this is obviously the sort of unique approaches that you guys have figured out. And then there's Tangent.
B
Yeah and Tangent is basically an automatic auto research loop that can help and kind of do your work for you effectively. Andrej Karpathe recently popularized it with autoresearch. Remember he said like he was speed running this. Yeah. You know the story here. We're basically bringing the same capability into Tango so that the Tangent can analyze just an agent that can run multiple experiments, figure out what can be changed and keep on rerunning it, keep on modifying until maximizing some goal, some loss function, whatever you need to achieve. And in general I would say if you're not using autoresearch like approach in whatever you do, like literally whatever you do, then you're missing out. We saw at Shopify that taking like a wildfire anything where you can put measurements can be done dramatically better. Our speed of templatization HTML completely new X templatization of reducing latency for liquid themes our search recently we moved from it's hard to Even quote from 800 QPS to 4200 QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines just increasing the throughput. We managed to improve the quality of Gisting and machine learning processes. You know, gisting is the prompt compression technique that allows for lower latency and lower and actually higher quality slightly. So literally whatever different works of life. And it doesn't have to be AI related. We had reduction in storage because the agents would go and find data sets that clearly are derivative and then you don't need to store things twice. We found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID and we literally put on the one. So translating two randomized it has access
A
to the code as well. So you can check the like what the hell is it doing?
B
So it could be run in two levels. At the superficial level, it could just use existing components and reshuffle them. You can grab XG Boost and you can grab some some Pytorch module and can grab some graph and other tools and combine them. At a deeper level. Since Tangle is all sort of CLI based, underneath every component is a wrap really CLI call and YAML file. It can Analyze code and create new components and keep on iterating as well. So you can both have quick modifications of existing pipelines with the. With components that are already there pre baked or you can create new components and keep iterating on those. So auto research is again, this is probably the thing I was excited the most in the last two months happening. And we see it taking totally like a wildfire, just everybody, every day, every. Well, every day, every minute I would have somebody slack message saying, oh look how much better I made it. And it's all throughout the research.
A
Is this democratized in some way in the sense that like is it your ML engineers and researchers doing this or is it your regular PMs and software engineers also have the ability to use Tangent?
B
This is an awesome question. Like Tangle in general and Tangent in particular are extremely democratizing. Like they are the main tools for.
A
Because I don't need the details exactly.
B
Initially used by ML and AI engineers. But then literally, as you said, PMs are like the highest user right now is one of PMs on our org Sartage. He was number one by usage of this because it's just energetic and knowledgeable. And now it unlocks a lot of capability where you don't have to change
A
code manually because it kind of cuts out the ML engineer from the process. Because the PMs have the domain knowledge and the ability to think about, from first principles, about, okay, what results do I want? And they even have access to the data that needs to go in. So it's like in some ways this is the magic black box that we've always wanted for training and for, I guess, hill climbing or whatever.
B
It's basically code for your AI development situation right now. You don't have to know exactly how algorithms work. You can just bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you've gotten the results that you need.
A
In my previous roles, every time that someone has pitched AutoML, you know, I've always been like, this is not, this is not going to work. It's, you know, it's always going to be a flop. Somehow it's working now. I mean, presumably the answer is now we have LLMs and it's good enough, right? It's an emergent property that we can do all the research, but it doesn't feel that satisfying. How come we didn't do this before? Right. We just did parameter search and I don't know, maybe that's it.
B
Yeah, Bayesian optimization and hyperparameter optimization was the One that facet of Ultramal that was used very actively, which incidentally also built into tangle. But I know Patrice Smart very well and he was such a proponent of photo ML. He put like literally spent careers trying to democratize it. Without LLMs, it just turned out to be very hard. Like you would have flexibility within certain narrow domain, but it was hard to wider scale. And now with LLMs, suddenly it's like magic wand. And so suddenly everybody is not enough.
A
Yeah, I think it's multiple things. Right. I'm just going to bring up the chart again. Right. Like LLMs can do the monitoring very well. It is very potentially unbounded, super unstructured. They can do the analysis and basically it is much more intelligence poured into every single step. There's maybe nothing structurally changed about AutoML, but this is just more intelligent and more unstructured.
B
Exactly.
A
Any flaws that you've run into, like everyone is like drinking the Kool Aid. Oh my God. Time savings, you know, performance improvements, like what. What issues have you have come up.
B
This is really cool. It's not a solution to all the world's problems for sure. The limitations are usually the ones I. And this is where we get into a bit of a subjective territory. I can only share what I have seen so far and I'm sure the situation is changing and maybe after I save, many people will reach out and say, hey, what about this? And you don't know that. And then we'll be probably right. But what I've seen is autoresearch is very good at doing kind of obvious things that you don't have bandwidth to do or you didn't notice or maybe you're not aware of some standard practices. It is not good at doing something completely out of distribution, something that you have to think for multiple days and do something like none of this. I set an experiment once on my hobby thing and I let it run for. Ended up several weeks run. You know, it's like full production kind of scale, so slow runs. And it performed in the end over 400 experiments and only one was successful. I'm like, okay, that's good. But.
A
But it saved time.
B
Yeah, I saved time like it was that thing. Yeah. If I were doing 400 experiments myself, my betting average, as I said, would have been much higher, I'm sure. But also, first of all, it would take me like three years to do 400 experiments and I didn't have to do them. Like the machines were just the price of electricity did that. So. And I got one improvement that in the my honestly, when I was starting that experiment, my thinking was to go and show that, hey, Andre, maybe you just don't know how to optimize. And I was super smart because in my probe problem, it was optimized for many years and it was like fully improved. And I didn't expect autoresearch to find anything at all yet I did. So instead of making fun of andrej, I ended up a big supporter. Yeah, that's exactly the tweet.
A
You and Toby really go back and forth online a lot, which is really funny. Think of it as an eval for the optimalness of the code it's running on. It's almost like it reminds me of like a Kolmogorov complexity thing, but I guess it's. There's some optimal thing that you're trying to sort of reduce down to, I guess. And so you should congratulate yourself that you had 99% optimality.
B
Exactly. I think Andre really deserves a lot of credit for popularizing this approach. This is incredibly, I think, powerful and cool and even him. Him just mentioning it led to a lot of gains in a lot of places in the industry. So we should be thankful.
A
Yeah, I think he also has a. Just. I don't know what it is like, you know, it is a simple self contained project that people can take and apply to other things, which is one thing, but also just the name. Just like somehow no one managed to call their thing. Auto research, which is naming things is very important. I think that is mostly our coverage of Tango and Tangent. I think obviously there's a lot of ML infra@shopify that people can dive into. We're about to go into SimJim, but before I do that, any other sort of broader comments around this whole effort? Where is it leading to?
B
As a segue to SimJim, all those things start composing strongly and you could see a huge unlock. When you can look at each one of the tools and you see, oh, they're extremely useful. Tango is useful by itself. Autoresearch is useful by itself. Sim Gym is useful by itself. If you combine all three, you create synergetic effect. I think that's why we wanted to even cover them today is because this is something that if you go back even five years ago would have been unthinkable. Replicating that would be either incredibly costly or impossible. Probably thousands of people required.
A
Well, we have serverless human serverless intelligence. Right. So yes, you do have thousands of intelligences, just not humans. And that's close enough. Right? Even if they're not AGI, they're close enough to do the task that you need them to do. And there's plenty for a lot of routine work, knowledge work. Okay, let's get into sim gen. This is one of those things I was surprised to see actually. It's apparently one of your most popular launches. And I think something that I think Sim AI, I think Yeonjun park who did the Smallville thing. There's a very small cottage industry of people trying to do like the simulated customer thing. I think a lot of people maybe don't super trust this yet because they're like well obviously they would just do what you prompt them to do. Right. But maybe just think tell us about the sort of inspiration or origin story.
B
That's exactly. Actually the thing I wanted to cover. Because if you don't have the historical data, all you can do is prompt agents in a vacuum and they will do exactly what you prompt them to do. In fact, when I first proposed it, and this is a bit of my brainchild initially if I can boast and then Toby said but wouldn't they just repeat what you tell them? But I'm like yes, except Shopify has decades of history of how people made changes and what there is what it resulted in terms of sales. So now what we can do is we can we have this. It's not. It's a noisy data. There's a small usually website, you know, like things, things are never in isolation. It's almost never a B experiment. It's always AA experiment when there's has two meanings. But basically you know, in different time you run two different things. But if you aggregate everything together and you apply denoising and collaborative filtering like approach, you can extract a very clear signal and then you can optimize your agents. And that's why it took so long. It took almost a year of that optimization of just us sitting and fiddling. And we had these internal goals of correlation of hitting internal goal was to hit 0.7 correlation with add to CART events for example like that if we run real AB test experiment that it should go and replicate same sort of success that humans had or lack thereof. And it took forever. And I don't think that's easily replicatable because who else would have that data? You have to have this historic decades worth of data. And now the other thing you need is infrastructure and the scale because again what we found statseq results, you need to run a lot of simulations, a lot of agents and those are expensive things like you're making actions in the browser because you want a real friction. You want to be able to get the image of what humans will see because you want to detect effects like, hey, if I make my images larger, will I have more sales or fewer sales? And like usually people's intuition here, by the way, is that I increase my images, I'll have more because they look nicer. Designers all look sparse and big images, usually your sales tank, right? But from HTML all the characters look the same, the size tag looks different, right? So it's very hard. So you have to take visual information, you have to run this in simulated browser environment on a big farm. And of course you have to have very, very expensive model, good model with multimodal model. So all this is what's taken so long. And to share my personal fail a little bit there, Sean is like, you know, we always had this bias to for like large company bias. You know, we always, whenever we do, we're like, hey, we would run an experiment, right? We make a change and we would run an experiment and then see, see which one's better. Oh, like no, this worse. And most of them are worse. So you discard it and keep iterating hill climbing and you're like, oh, like smaller merchants, they cannot get statistic results. They cannot really run experiments simply because in a week there would be not enough data for them. So we thought from this perspective, what we didn't realize is that most people don't have A and B. They just have one thing and they need suggestions of what A and B should be. So we first build this A. We run simulation on two separate teams and say, hey, which one is better? We then morphed it into and very recently just released it. When you have just your site, your theme, we run over it and we say, hey, here's what predicted values of conversions are and here's how we think you should modify it to increase your conversions. And then circling back to what you started with, the proof is in the pudding. If we are not correlating with reality, people will not be using it. And thankfully we see literally every day more usage than the previous day. So right now, right now my problem is how to pay for it all. Because our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers and headful browsers cheaper so that we can accommodate the increase in traffic.
A
Yeah, I understand that you published a lot of technical detail at gtc, so I was just going to bring it up a Little bit, I think. Was this in conjunction with some kind of GTC presentation or something like that?
B
Right. Well, yeah, we did it in several places, but yeah, we had engineering Blog as well.
A
Yeah. So you're running GPT oss.
B
This is an older version now we run multimodal model, but yeah, GPT oss. We still run GPT OSS as well. For.
A
And then you have the VMS and you also have browser base. I really like this one where you said it violates almost every assumption that standard LLM serving is designed for. And then you had like basically orders of magnitude differences between everything.
B
Exactly. Which was, you know, a bit of a challenge to implement like one like even simple things since it violates all the assumptions. For example, multi instance GPUs like mix don't work as well. But we needed to get MIC to work because otherwise it's way too expensive. And so we had to deal with lots of infrastructure and work with Fireworks and sent them out to help with optimizations and browser bases you mentioned. Yeah, takes a village.
A
Okay, so there's a lot of, I guess, experimentation in the infrastructure so far and you've published more or less what you have here. I guess I'm less familiar with SentML. I don't do that much work in this part of the stack. Why was it the sort of preferred instance platform?
B
There are really three probably top companies. There used to be three top companies, at least I was aware of that. Did LLM optimization together Fireworks and Sentimel. Not necessarily in that order. Sentimental recently got acquired by Nvidia. What they did is if you have a model and you want to optimize it to a specific profile of usage, they would go and do it. And we work with those companies. This was work particularly with Sentinel and Nvidia to get the best possible results out of it. And. And sometimes you have to retune depending on. Sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want the cheapest. Right. Or some combination. And so yeah, these are people who would come and help you.
A
I see. Yeah, yeah, I'm familiar with these people for the LLM autoregressive stack. But the other interesting category of these optimizers is also the diffusion people. Whereas like Phal and Pruna recently has come up a lot as well, which I think is really underappreciated at least by myself because I thought, oh, all the workload would be LLMs. But actually there's a lot of diffusion as well.
B
Exactly.
A
There's a lot here so it's hard to cover. But I do think people underappreciate the importance of customer simulation. Basically I think this is something that I'm candidly still getting to terms with. Your team also prepared this really nice diagram. I assume this is AI generated. Yeah, maybe it's not.
B
It looks Gemini ish. Yeah. But obviously I don't know how they generated it Looks like it's Google. But the interesting part Sean, that we haven't covered but I wanted to mention is if your store had previous customers rather than it's a new story or like new merchant just launching things, it helps tremendously in just correlation in forecast. Yeah. We take your previous customers behavior and we create agents that replicate those specific distribution of customers that you get and then we apply those to your changes and then that raised raw the just correlation with the add to cart events or with conversion or whatever. It may be quite dramatically. So replicating humans in general seems like an interesting cool challenge.
A
As a shareholder, I think this is the. If people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The more you use Shopify the more it will just automatically improve. Right. Like you're doing the job for them.
B
Yeah, that's what we started with. Otherwise if you're just a startup. I wouldn't do it if it was my startup. Because without the data, as you said, it's exactly the case that whatever you say in prompt, that's what the agents will be doing.
A
The statistician in me wants to really satisfy the statistical intuition. I guess to me it's kind of the word that comes to mind is ergodicity. So let's say a customer takes this path, customer takes this path, customer takes this path. Right. In my mind, the way I explain it is like okay, here's the 95 percentile, here's the 5 percentile, and here's the median. Right. But to me what SIMJIM is potentially doing is that it can sort of model the sort of in between sort of journeys as well that maybe are dependent on the previous states. This may be like a very RL type conclusion where basically the summary statistics, if you only did naive A B testing, you only have the statistics at a certain point and you only judge based on the sort of overall summary statistics. But here you can actually model trajectories.
B
Does that make sense or that makes total sense? Well, that makes even more sense that maybe even you realize because.
A
Okay, please.
B
Yeah, so internally we have this system. We talked about it briefly once at neuripsy we have huge HSTU based system that models the whole companies and their possible paths and what you are showing. Actually at any point of time, you can either model the user's behavior or you can also think about the whole merchant as a company, as the entity that acts in the world. You can model that as well. And then you can do counterfactuals in your graph, like in your blue graph. If you're imagine in the center there, somewhere in the middle, you would have an intervention. I gave that person a coupon or I don't know, I sent a personal thank you card or gave you a discount in somewhere. And then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even change where that intervention in time can happen, right? Like we're in this journey. So we do this at the Shopify scale for our merchants. And then if we notice that something that they can be fixing, like there's a strong contractual, like we have Shopify pulse, they basically get notification like hey, we think something is wrong with your, I don't know, Canadian sales. Like it looks like it's misconfigured. Here's what you need to do. Or you think you have to set up this campaign with these parameters and we do that at the buyer level to literally offer discounts or cash back or things to buyers. So I'm getting very excited. This is my sort of area of interest, I guess and hobby. But being able to model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind of intervention to make. It's such an unlock that previously was completely impossible. It was always dreamed of but never how would you even simulate it without LLMs or Hstus. I think very, very exciting times.
A
I just wanted to maybe illustrate this. I'm not the best illustrator, but I am a conceptual statistics guy. And you cannot just do this. This is a dimensional that a B test doesn't do because it doesn't have the change over time stochastic nature and it doesn't have the sort of contextual. Here's all the context to this point. Okay, cool. That's Seung Jin. You're going to burn a lot of tokens on this thing. But you're one of the only scale platform in the world that can do this across a huge variety of workloads, right? I'm even curious on a sort of human research level of like, well, does retail behave differently from pulling sales? Does that behave differently from electronic sales? I don't know. I don't know what else you guys, the Kardashian shoppers do. They differ from people who buy, I don't know, cars and whatever.
B
Very different and different sensitivities and different modes of shopping and different levels of what's important now. Totally. You can do aggregations at a store level. You can do aggregations at a different category level. I don't know if for our statisticians among us, I couldn't believe. But recently we're looking at it and we had to bring back CRPs. Chinese restaurant process. It's a way of aggregating and naturally grow clustering across specifically to answer questions that like you were just posing on how if buyers behave different categories. And I'm like, I haven't seen CRP since 2001.
A
It's so what is. No, I haven't seen this. No, this is not in my training.
B
But yeah, it actually there was a very popular kind of theory, popular new rips, ACML circles in early 2000s. Kind of nice. And now it has practical applications that we were resurrecting.
A
Yeah. Amazing. I can see how this is a fun job for you where you get to apply all these things. Yeah, yeah. So super cool. Super cool. So okay, so anyone who knows what CRPs are and has always wanted to use them at work, they should definitely join Shopify. Okay. So we have a lot. But I'm being mindful of the time. I do wanted to sort of COVID some other things. I'll give you a choice. UCP or liquid.
B
Liquid. I think on ucp UCP is very important for us. And it's just ucp. We have a structured discussions and you can read about them and we have blog posts and we have a big release this week in fact, like with our catalog.
A
Okay, yeah, but I mean we can discuss the release briefly because we'll release this after it's already announced. So whatever. There's a catalog that you guys are doing.
B
Yeah, so we are bringing in capabilities of whole Shopify catalog. Basically you can search for products, you can do lookups by specific id, you can do bulk lookups when you need to bring multiple products. You don't need to know in advance what you are trying to show to sell or check out. You can now have this decided at runtime and this big area for investment for us for both non personalized and personalized searches. Trying to provide basically a window into Whole universe of products that are being sold everywhere in the world and Shopify is really not exactly but almost like a superset of anything being sold. Now we are bringing it into UCP and identity linking is another big thing for us so that you can use like Google or whatever identity you have. They're minimizing friction. So yeah, big release for us. But liquid AI of course we never talk about and the problem might be more aligned with what we discussed previously on this chat. Sure.
A
The main thing that everyone understands about Liquid is that it is inspired by Worm and I still don't know why. I'm curious on your explanation. I think you can make things very approachable and also I think what is the potential of the level of efficiency that you get out of Liquid?
B
We're all familiar with transformer architectures and for the longest time there was a competing architecture called state space models. Sams Chris Rez, one of the pioneers and lots of startups trying to make those realities. They have significant benefits, main being much faster and lower footprint and not quadratic in length, sort of linear in your context length. But with state space models they never quite made it like they're used. They have certain niches when they thrive. Their hybrid architectures are useful but they never quite made it. And liquid neural networks are. You can think of them as a next step like sort of state space model square. It's non transformer architecture that's more complicated than state space and really difficult to code if you, if I'm being honest. But it's very efficient. It's sub quadratic in length of your context. It's very compact way to represent things. And that's a liquid AI company. Their goal is to productize it. And very often you have this need when you need to have long context and small model and you want to have low latency in general it's basically on par with transformers and if you do hybrids with transformers it's even better. That's why we at Shopify when we tried multiple and we constantly try multiple models, multiple companies we found that for small, particularly with low latency applications, when you have low latency and or if you need longer context lengths, LinkedIn was the best. And so we still use the whole zoo and always obviously test and use everything. Every open source model and it feels like sometimes even every private model. But Liquid's been taking quite a bit of at least internal Shopify share And the reason I'm excited is yeah because it's the only non transformer architecture that I found being Genuinely competitive and we use it for search and for long context Pulse distilling and others. This is the overview. I don't know how approachable. Sorry, maybe it's still too up to us.
A
I mean, I think they haven't been that open about their implementation details. I think the. I would say like liquid hasn't been like, if there's a lot of technical detail published. I haven't read like a formal sort of paper on the implementation details, but I did get the sort of relationship between the SSMS and the others. This was one of the sort of charts that was showing the relationship between full attention versus something that's more like a RNN type in terms of their efficiency. And then the other chart was this old one where it compares versus some of the other models. Doesn't exactly have the correct Y axis but close enough where you can see like it's basically a step change difference in terms of the efficiency. I think the surprise to me was that you guys are actively using it already internally inside of Shopify. And I'm curious, what are the constraints that you're optimizing for? Right. When you say smaller, is it like the 1B size, what kind of latency constraint are you optimizing for? What kind of context length? Sort of considerations. Right. I think for example, in the audio kind of use cases the SSMs effectively have unbounded context length because they just have to operate on the sliding window of the most recent stuff. I'm just kind of curious, what do you see the potential here?
B
Yeah, the stems are effectively because the state embeds all the previous information needed. Or that's the assumption. SMS effectively have infinite context length. The problem with them is that expressiveness is not there. The liquids are effectively souped up. Sums were much more expressive, more complicated again to code. There's a paper on it, you can see it differential equation rolled out and then computed as really as a convolution it's a bit involved. The thing where we use it is specifically either for where we need super low latency. It was a lot of very fun project with Sentinel and Liquid AI themselves. We run it at 30 milliseconds, a tiny model like 300 million parameters. But we run it in 30 milliseconds end to end for search when you type a query and then we produce all the possible things, what you can mean by that query and some not only synonyms, but kind of full query. Understanding the whole tree of what you might need and including your personal personalization because you might have done Previous queries and lowering it all down into a search server so that the requirements on latency obviously they're very strict. So then we are able to run it under 30 milliseconds because at liquid, you know, QAN doesn't run at this. And even liquid we had to work a lot with Nvidia because almost everything is not designed in Cuda or in the current stack for low latency. Like small things that don't matter with large models start mattering a lot and we had to optimize it. There is different end of the spectrum where this is maximum through bandwidth throughput for things like for example offline categorization. When a new product appears, we need to do analysis, we need to assign where it is in taxonomy. We need to extract and normalize attributes, we need to do, you know, clusters like oh, it's the same thing as that other merchant is selling, right? That is like almost unbounded amount of energy you need to spend on it because it's quadratic kind of problem and we have billions and billions of products. So you don't care about latency as much. It's kind of an overnight batch job. But you want to maximum throughput. And usually in those cases you also sometimes like for Sidekick Pulse you also need long context. These are, we are talking models in maybe 7,8 billion parameter range where we would take a large model like we would take something huge largest we can find. We would distill into liquid for specific tasks such as for example for our catalog formulation or for Pulse and then we run it at a very large scale like in bad jobs because just running and it beats in that situation very often beats Quen or. Yeah, Kim is more on the reasoning side. So Quinn I would say is probably their major alternative. That's when we use it. Not a panacea. Not really. I wouldn't say that it's frontier model in the sense of it's not going to suddenly compete with GPT 5.4, but it is a phenomenal target for distillation which is right now becoming more and more important with explosion of token usage.
A
Is that a now only thing or do you think you give Liquid $100 billion and they will? Is it just more scale or like what is limiting it? What prevents it from running into the same issues that SSMs had?
B
Their scale is already much larger than the largest SSM I'm aware of. SSM was just not expressive enough in my opinion. Again, I'm sure I'll get a lot of pushback and probably so but in my opinion SMS are not expressive enough and Liquid models are I think especially in their hybrid form combined with transformer like in Mamba fashion, they probably the best architecture I'm aware of like period. But of course Liquid AI is not at the scale of Anthropic or Google or OpenAI in terms of compute. So I think if they had similar level of compute they would be very competitive and maybe even beat the largest models, at least from what I've seen. They don't have this level of investment, but they still have decent investment and it's definitely for this scenario of smaller models and distilling into their second to none. Very often we are very omnivorous and we are on purely merit based. So the moment they will start being competitive we will switch to something else and we constantly test. But so far if you see progression, if I draw a graph of our workloads on Liquid versus our workloads on I would say Quem, which is another awesome model and probably another kind of standard within Shopify. I would say Liquid's been definitely taking share.
A
I think that's very promising and probably the best explanation I've heard directly from someone involved in Liquid. I do have Maxime Le Bon coming to my conference in London this week so I we'll hear more from him because there was this Liquid Investor day or something like a year or a year and a half ago and I think there just wasn't that much technical detail that I think was sort of speaking to my crowd of potential customers and users. Right. Which yeah, it's fine. Maybe we still need to wait for more results that come out before this, but I think it would be news to a lot of people that you guys are actually actively already using it for high frequency use cases. I also wanted to highlight Psychic Pulse which we didn't cover and we probably don't have time to cover, but it's something that you also launched recently, basically Rexis, but also something that the other Rexis trend I've been covering a lot from the YouTube side. Even Xai's Rexis has been LLM based Rexis, which I think you are also effectively using Liquid models for. But they are just throwing transformers at the problem and maybe this is the sort of hybrid architecture shift that will happen in order to accommodate the kind of long context and high efficiency that you need. I don't really have a strong opinion there apart from I would highlight to anyone the work that the LM based Rexis community is doing is also very interesting there.
B
Yeah, the Again, the thing to get you excited is that it's not just LLMs looking at things, it's also HSTU model doing that counterfactual analysis where we model the whole enterprise as an entity and its actions and then see what will happen.
A
Overall, I think this all presents an enormous. I think there was not that deep of an AI story to Shopify when it started. It was just a WordPress plugin. Right. But now you are the storefront e commerce guardians to so many, so many people and you're really applying all the AI methods and the state of the art stuff. So I think our conversation today has really, I guess, opened my eyes a lot. So thank you for doing this. This is a really amazing overview of what you're doing.
B
Thank you for saying that, Shannon. Thank you for having me. Of course. It's always a pleasure to talk to people who deeply, technically know what they're talking about.
A
Yeah. I mean, very few people are as technical as you, but at least I can somewhat vaguely follow along. Yeah. So, okay, there's a hiring call. Any particular roles that you're looking for that you're like, okay, if you know how to solve this problem, reach out.
B
Yeah, the things I would definitely call out that if you're an ML person or if you're data science person and we have huge need for more people munching data, so to speak. Or surprisingly, if you're distributed database person. And we think that there is a way to use LLMs to reimagine how we do distributed databases and we're working a lot with Yugabyte there. And so if you have interest in those areas, like Shopify might be the best place in the world for you. That's pretty good place for other, you know, other disciplines as well.
A
Cool. I think that that was all the questions I had. I have one sort of bonus thing if you, if you want to indulge in some Bing history. What is your, I guess, takeaways or any fun anecdotes about Sydney?
B
Any fun anecdotes about Sydney?
A
Well, yeah, it was a very interesting. I think it woke up people to this personality that emerged.
B
The funny thing, I mean the most interesting anecdote is that Sydney was first shipped in India and it was not noticed for a long time. And first implementation of Sydney didn't even have OpenAI model under it. It was during Megatron, Microsoft and Nvidia collaboration model. And there were. Yeah, exactly. That's the one. People thought it was a prank because not many people were familiar with LLMs at that point yet and thought that can be automatic. You must have. You know, people think. And then even they were complaining that, oh, that my. This. This chatbot is gaslighting me. And then, then people like, what. What almost everybody doesn't fully realize is that it wasn't by accident that Sydney was Sydney. I mean, we spent a lot, A lot of effort on personality shaping where. I mean, it was a bit of my Yandex legacy where previously we did this Alice digital assistant, which we learned. Yeah, we learned the importance of personality shaping. And so here we brought. Did a lot of personality shaping. So it was not fully an emerging scenario. It was also a little bit edgy. What we learned in those experiments is you want to be polite, but you want to be a little bit on edge. And that draws people in. I haven't seen ever since the kind of those days, I haven't seen anybody trying exactly that mode. I think we will see more of this at some point. But yeah, lots of good memories, you know. And by the way, the very first Sydney Devlet is. Andrew McNamara is working at Shop Find and the head of Sidekick and our. And the Pulse. Lots of these are actually.
A
Yeah.
B
In his purview.
A
Oh, okay. That's another fun fact. You're assembling the team again. Yeah, it's cool. I think a lot of people woke up to the idea of AI personality for the first time there. And I think now with maybe OpenClaw explicitly prompting a fun personality, I think that is a real selling point for people. Right. And then I guess maybe the only other time that is when I really emerge into public consciousness is Goldengate Claude. But yeah, I think hopefully someday we'll get Shopify. Sydney.
B
Well, we have Sidekick. It's a different thing a little bit.
A
Yeah. Sidekick was like your original big launch for AI stuff. Yeah. Cool. Amazing. Thank you so much. You guys do amazing work. Honestly, if I was a Shopify customer, Shopify investor, hearing all the work that you guys are doing on this technical side, it makes me feel more confident in like, okay, just choose Shopify. Right. You are never going to do. Do this in house, which is obviously what you want. But yeah, I mean, that's what an ideal platform is that you're doing all the things that no individual could do at their scale, but you can at your scale. Very exciting problems.
B
Exactly, exactly. And creating network effect and hard to disagree. If you're not using Shopify, you should.
A
Yeah. Amazing. Okay, well, that's it. Thank you so much.
This episode dives deep into Shopify’s transformation into an "AI-first" engineering organization, featuring CTO Mikhail Parakhin. The conversation explores Shopify's internal and external deployment of large-scale AI tools, the 2025 "phase transition" in usage, and exclusive details about in-house systems like Tangle, Tangent (auto-research), and SimGym (simulated customer behavior). The discussion is highly technical, aimed at AI engineers and infrastructure leaders, and covers the evolving challenges and approaches to AI-powered productivity, internal infra, and product innovation at Shopify.
[02:29–06:54] Rapid Rise in Internal AI Tool Adoption
"It's hard not to do your job now without interacting deeply, at least with one tool... December was the phase transition when suddenly models gotten good enough that everything took off." — Mikhail Parakhin [03:20]
Token Budgets:
[04:47–06:35] Employees have an effectively unlimited Opus 4.6 token budget, discouraged from using anything less powerful.
Quality vs. Quantity: Top percentile users' consumption is skewing upward — concern that a few might consume a majority of resources.
"If this rate of separation continues... there will be one person consuming all the tokens. Kind of strange." — Parakhin [06:43]
[08:03–10:25] Quantity vs. Quality in Token Consumption
"You can consume tokens and in fact the anti-pattern is running multiple agents... that's almost useless compared to just fewer agents and burns tokens very efficiently." — Parakhin [09:01]
"Good model writes code on average with fewer bugs than average human. But since they write so much more of it, more of it will make it into production." — Parakhin [10:10]
[12:33–17:10] The New Code, Review & Deployment Realities
"CI/CD interaction with the code repository right now is clearly the main issue and the bottleneck for us... everybody's pipelines start creaking, times are increasing, bugs slipping by." — Parakhin [15:23]
"Maybe... microservices will make a comeback because then you can ship things independently.... I'm a lifelong opponent of microservices... but maybe." — Parakhin [16:27]
[18:24–25:52] Data Science Collaboration & Reproducibility Platform
"It's full versioning... you will never have to do digital archaeology." — Parakhin [23:40]
[26:14–35:56] Automated Experimentation Agent
"If you're not using autoresearch-like approach in whatever you do, you're missing out. We saw it at Shopify taking off like wildfire." — Parakhin [26:50]
"...highest user right now is one of PMs... It unlocks capability where you don't have to change code manually..." — Parakhin [30:07]
[37:20–50:59] Simulated Customer Behavior for Merchants
Unique Value: Trained on decades of Shopify merchant/customer data; simulates not just "prompt-following" agents but real-world behaviors with strong correlation to actual conversion events.
Multi-Modal Simulation: Runs fully rendered browser environments, uses best-in-class (multimodal) models.
Overcoming Small Merchant Data Gaps: Provides actionable recommendations even without A/B testable traffic.
"If we are not correlating with reality, people will not be using it. And thankfully we see literally every day more usage than the previous day." — Parakhin [42:47]
Advanced Counterfactual Modeling: SimGym models customer/company behavior as rollouts, can encode interventions, and simulate long-term customer journey effects.
"Being able to model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize, it's such an unlock that previously was completely impossible." — Parakhin [50:39]
[55:23–63:00] The Non-Transformer Path
Why Liquid (vs Transformers or SSMs):
"Liquid neural networks... is very efficient. It's sub-quadratic in length of your context. It's the only non transformer architecture that I found being Genuinely competitive." — Parakhin [57:36]
Production Use Cases: Real-world Shopify workloads for long context and low-latency search, offline categorization, etc.
Hybrid Models Likely Future: Liquid (and hybrids with Transformer) seen as most promising for distillation and efficient serving.
"If they had similar level of compute, they would be very competitive and maybe even beat the largest models, at least from what I've seen." — Parakhin [63:14]
[66:39–67:44] Strategic Vision
"...really applying all the AI methods and the state of the art stuff... there was not that deep of an AI story to Shopify when it started. But now you are the storefront e commerce guardians to so many." — Host [66:39]
On AI Tool Explosion:
"It's hard not to do your job now without interacting deeply, at least with one tool." — Parakhin [03:20]
On Review Tooling:
"You want to run the largest models. That means Codecs or Cloud Code is not going to cut it." — Parakhin [11:23]
On Democratization:
"PMs are like the highest user right now... it kind of cuts out the ML engineer from the process." — Host & Parakhin [30:38]
On the Magic of SimGym:
"The proof is in the pudding. If we are not correlating with reality, people will not be using it." — Parakhin [42:47]
| Timestamp | Topic Segment |
|-------------|--------------|
| 02:29–06:54 | Internal AI tool adoption & phase transition |
| 08:03–10:25 | Metrics for AI engineer productivity & token consumption |
| 12:33–17:10 | PR growth, CI/CD bottlenecks in agentic world |
| 18:24–25:52 | Tangle system: ML reproducibility, sharing, platform effects |
| 26:14–35:56 | Tangent: Auto-research, agentic experimentation |
| 37:20–50:59 | SimGym: Simulated customer behavior, counterfactuals at scale |
| 55:23–63:00 | Liquid Neural Networks: Shopify’s infra/serving path |
| 66:39–67:44 | Shopify's transformation into an AI-native platform |
Shopify’s AI transformation is both deep and practical:
For further technical details, visit the Shopify Engineering Blog and search for the Tangle, Tangent, and SimGym posts.