Loading summary
A
Foreign.
B
Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language.
C
Hi, everyone. Welcome to the Analytics Power Hour. This is episode 290. I'm Tim Wilson and I'm joined for this episode by Val Kroll. How's it going, Val?
D
Fantastic. Excited for today?
C
Outstanding. Unfortunately, we were supposed to also be joined by Michael Helbling for this show, but he's gone all on brand for the winner and gotten the flu. Luckily, as we're into our 11th year of doing this show now, we've learned a thing or two about rolling with the punches. And as it turns out, learning is the topic for today's show. I mean, it's implicit in all forms of working with data. We're looking at analysis or research or experimentation results and hoping, just hoping that we come out of the experience with a deeper knowledge of something. I mean, and hopefully it's something useful. More knowledge than we had before. It's a simple idea. Sometimes, though, it's a little harder to execute in practice. That's why we perked up when we came across an article from some folks at Spotify called Beyond Spotify's Experiments with Learning Framework ewl. We're excited to welcome one of the co authors of that piece to today's show. Martin Schulzberg is a product manager and staff data scientist at Spotify. He has a deep background in experimentation and statistics, including actually teaching advanced statistics in a prior role for a number of years. So who better to chat with about learning? Welcome to the show, Martin.
A
Thank you so much. Excited to be here.
C
All right, we are. It's a borderline, like, giddy about the topic. As we were diving into to our excitement before we hit the record button.
D
We definitely fought over who got to be on this one.
A
Yeah.
C
So Martin, in the article that I referenced in the opening, which we're definitely going to link to in the show notes, is it's a great read. You and your co authors make the distinction between a win rate and a learning rate for experimentation. And that's kind of the premise of the article. Is this win rate or this learning rate rate as a proposed metrics or a metric that's actually in use? And that seems like a good place to start. So maybe can you explain what you were seeing as a drawback to kind of too much focus on win rate as a metric for experimentation programs?
A
Yeah. Yes. So I think I need to take a little step back. I think it started with when we rolled experimentation out at Spotify properly, like at scale 20, 1920 20, we quite quickly realized that one of the biggest wins that we made over and over again was to detect bad things early and being able to avoid them. So using it as a sort of dodge bullet type of mechanism. And we have used it like that since it's one of the biggest reasons why we run so many experiments. We want to avoid shipping bad things that happens unintendedly, side effects and stu like that. And at the same time, I've seen over the years a lot of blog posts and papers published about win rates from other companies. Win rates as in the rate of experiments where you find a variant that is better than the previous variant and you ship it so a clear winner. So I just felt that it was sort of under celebrated all of the other types of wins that you can make besides finding something that was better than the current version. And I also think that it doesn't really reflect how most companies, at least the companies I'm familiar with, are actually using experimentation. They're using experimentation partly to optimize things. So to find winners to continuously improve something and optimize it. But that's only one part of that puzzle. The other part of using it as a mechanism for safety and safety net is something that wasn't, I think, talked about enough. And so that's, that's sort of where this sprung from.
D
I love that. And the, the one thing though that I think is I would love for you to talk a little bit about more that I think even if an organization was like, yes, like in spirit, I completely agree with that premise, Martin. It seems like using a metric like learning rate seems squishy, like win rate is objective. We can tally that in a column and calculate that percentage. But can you talk a little bit about how you thought about the criteria for determin? How you say, yes, we learned something from this experiment or how it's defined?
A
Yeah. And so, yeah, I want to firstly call out that this was a team effort. It was a lot of people involved. It was driven by the central experimentation team at Spotify, but there was also a lot of other data scientists that are actually doing product work that was involved in this discussion. We had a lot of really good discussions actually about what learning means and when you actually get value from an experiment. And so I just want to call that out. And I think we see it as there are essentially three ways that you can learn from an A B test. One is that you find an obvious winner. So what other people have referred to as win rate. So you find a version that is Better than the current version. The other one is that you find that the current version was worse somehow, that you detect something bad, that you detect the regression, or often that can be not only that users didn't like it, but more that maybe something went wrong with some integration somewhere. So you get latencies increasing or crashes increasing. And so those are quite obvious wins. So the finding better stuff and avoiding worse stuff. And then there is the middle one which is more nuanced, which is when you run a well planned experiment and you find nothing. So a neutral experiment, which is, I guess, vague, but what we count there as a win is an experiment that actually had a sample size calculation before that did a proper power analysis and said, hey, I want to have a certain power of finding an effect if it exists. And then they ran that experiment according to that plan and they found nothing. We also view that as a learning because at that point they can actually, with the certainty that they hoped for, say, no, there was no effect from this change. So the neutrality in that case is informative because you can say, hey, maybe this is not worth pursuing because we actually ran a proper experiment. If there was an effect of the size that we were interested in or that we hypothesized, we would have found it. So there are those three cases and obviously that middle one, the neutral one, is a little bit more complicated. It's more complicated to implement or to instrument because you need to know what sample size calculations were run and if the experiment actually met the planned sample size and all of those things. Fortunately for us in our tool, it's fairly easy to do. But yeah, take some thinking to get that right.
D
I'm literally writing notes because there's so many things I want to dig into. But before going to the 5,000 foot view, I guess I'm just curious about like the culture change internally. With so many people with access to run experiments and this appetite for experiments, what was it like to get them to shift away from the win rate to this other new kind of metric that you kind of rolled out? Like I, I can. I'm just very curious what that experience was like, if there was resistance, if there was excitement or some people were really questioning it.
A
There's always people questioning everything at Spotify, which is one of the things that I love about Spotify. So, so that's a fair, fair. That's a, that's a fix. That's a, that's. But yeah, no, I think because of the fact that we so early realized that experiment was such a powerful tool to Avoid mistakes and to detect bad things early. I think that the sort of common definition of learning was already incorporating that aspect of experimentation. I think a lot of people has sort of, over the years learned to. I should not use the word learned. Come to appreciate that. Yeah, exactly. Come to appreciate that. That avoiding something bad is a great learning and something that is super valuable for product development. So I think that part was not so controversial when we developed this metric. I think the neutral one, it's trickier. And there's also much more room there for discussions about what should count. Should you be super strict about that? It should be exactly powered. Should you allow some wiggle room? There's a lot of things that you can discuss there. And so I think we were eager to get a very clear and explicit definition out. And we were also eager actually to write about it externally because we were hoping that other companies would. And I guess this podcast is a good example of that, too, that we could have this discussion, because I think it's been a. I'm really curious how other people think about this. So I'm not convinced that our definition of learning is like the ultimate one or the final one or anything, but I think it's a good first step away from the more naive only win, only wins count definition.
C
So on that neutral one, like the raging cynic in me would be, well, gee, if people realize that's a way to game the metric would be to run really inconsequential small tests, which at the same time, the analyst in me thinks that, yeah, that happens with analytics a lot. That you're kind of digging in and trying to find something. You're like, well, somebody thought there would be some relationship here, and we're just not seeing it. And that can be equally unsatisfying for the analyst. So how do you think about neutral being. We were trying something that did have a legitimate chance of being meaningful. And maybe this kind of bridges to another article that you wrote, which is. How do you say neutral but not have neutral become a crutch for. Yeah, we're essentially doing AA tests and giving ourselves two thumbs up on the learning rate.
A
Yeah, it's a great question. I think we've been thinking a lot about what a healthy distribution should look like. A healthy distribution of different types of wins, and also the proportion of neutral experiments. And I think that's actually a super interesting topic. I think depending on what kind of strategy you have here from a product side, you can want to have different distributions. So, for example, if you take the. If we wait with the neutral one because it's maybe trickier one. But if we think about how many experiments should you find regressions in that you dodge versus the win rate, how should that distribution look? Well, that will depend on a lot of things. But if you're a company that has everything to win and little to lose, then maybe you can afford to have a high rate of just trying stuff because whenever you find a win, it's going to be quite big because you're still in early stages. Whereas if you're a product that is already very mature, then maybe you have other goals for those things. So it feels like it's a super interesting discussion to have. And that's one of the discussions we're having now with teams at Spotify and other people that are using our experimentation tooling that what should we do with this information and what's good and what's bad? And I think it's. It's different for different parts even of organizations within Spotify, what's good, depending on how they're looking at it. But for sure, we wouldn't look at the learning rate only. So we would say we want the learning rate to be reasonable. But then we of course should probably aspire for having a high win rate. That's nothing bad in itself. But at least if we have a high learning rate, we know that we're not wasting our experimentation efforts. We know that the experiments we're running we're actually learning from. If we're running a ton of experiments that are not powered and neutral, then we will never be able to say these things didn't have an effect. We can't separate between if these things didn't have an effect or if we just didn't run a good enough experiment to detect it. On the one hand, you look at the learning rate and say, hey, we want to utilize our experimentation bandwidth really well. So we want to have a high learning rate at all times. But then at each quarter, you can look at this metric and the distribution of these outcomes and say, hey, you know what? We're dodging a lot of bullets, but we're almost never finding something good. Should we rethink our strategy? Or even more, if we're finding a ton of neutral results and we see more and more neutral result in some part of the organization, maybe we're hitting diminishing returns and we should try something different. Maybe we have sort of optimized. We found some kind of local optima, maybe or something like that. So I think it can be a quite Strategic instrument. If you have all of these, the distribution of all of these outcomes as part of the learning metric.
E
You know what's worse than writing SQL? Probably writing that same sequel for the third time because you forgot where you saved it.
C
Or explaining to an LLM for the 10th time that your GA4 medium field is a mess because three different interns had three different naming conventions.
E
Yeah, like organic organicsocial or. I mean, it's like a crime scene of good intentions.
C
Which is why AskWise skills feature really helps record that data cleaning nightmare once as a skill reuse it across different data sets.
E
Portable expertise and their jam memory system remembers context like the July data is doubled. Or use the prod table not staging.
C
Exactly. It's context focused, not just code focused. Plus your data never touches the LLM semantic layer, generates code that runs locally where your data presumably won't judge you.
E
For that medium field situation, we can hope go to Ask Y AI. That's Ask Y AI Use Code APH to jump the wait list and stop paying the context switching tax.
D
That's making me think, as you were talking about that, that even within an organization, like you're saying, like companies who have everything to gain or, you know, I think it's everything to nothing to lose. I forget. Exactly.
A
Like, I never get that right.
D
Well, apparently I can't either. But even within Spotify, like thinking about the different product teams that if, if it's a group that's, you know, working on the cancellation flow and thinking about retention, they're probably having, you know, very different distribution of those outcomes as their goals or targets versus playlist creation, which is like such an established user pattern. Is that how you customize some of those conversations from the center of Excellence perspective to kind of consult with those teams?
A
Yeah, I'd say so. But I also add that there's a lot of centers of excellence when it comes to experimentation at Spotify. Fortunately, we have many parts of the organization that have super strong experimentation. Like organizations or champion groups or nerds, as I like to think about. I mean, yeah, look who's talking. But anyway, no but. So I think. So that discussion happens locally in a lot of places and a lot of people are having those discussions. So it's not like sometimes we get questions about how to think about things. And also one interesting aspect of this metric is that sometimes you might find that if you're actually, we didn't talk about. There's one outcome here that we didn't talk about, which was when you get an invalid experiment where something is wrong with the setup of the experiment. That's the final sort of outcome in this learning framework. So you didn't learn because something went wrong. For example, something wrong with the integration. Maybe you got imbalanced treatment and control group assignments for some reason, or you don't get all of the data that you should get, or something like that. And that's of course, an outcome that is the least fun one, so to speak. It's just like, yeah, we couldn't get this integration to work well enough. So we have used that one and worked really hard on getting that to as close to zero as possible. So we want it to be possible for anyone to run a really high quality experiment. With Spotify running experiments on so many different devices and apps and combinations of those, it's really tricky to always nail those things, but it's obviously an important signal. So whenever that one is high, that's something that teams come to us with and say, like, hey, we don't get our integration to work as well as we want to. How can we improve these things? And also when it comes to the neutral aspect, the quality of the sample size calculator starts mattering a lot. So whenever someone sets up an experiment and we try to predict what sample size they need, it's a prediction, right? We're looking at historical data saying like, yeah, well, given how historical data has moved, the variation in that data and the means and the treatment effects that you say you're interested in finding, we think that you need to run your experiment for this long to reach this many users. And that's a prediction that takes a lot of things into account, but it can always be improved, probably. So that's also a conversation that we sometimes have when people are like, in our use case, the sample size calculator is not good enough.
C
But that is a case where you. That's one where you would come back like, what is the scenario where you run it? They've got a mde, they've got the estimates, you've got the sample size calculator, it says run this if it comes back. I'm trying to understand the distinction between actually we probably just didn't run this long enough versus well, for what we ran and what parameters we put in. It's a neutral result. And I may be miss. Is. Is there a distinction?
A
Yeah, I can speak. I can. Yeah, I can speak a little bit to it. So in practice, when we do the sample size calculation, I don't know how technical and nerdy I'm allowed to get Here, but given the name of this podcast, I'm going to go deep. Yeah, no, but, so we don't, we.
C
Don'T want to hit like if, if Matt Gershoff would have to think about it for a minute. That's a little bit too tight.
A
No chance. No chance. This is bread and butter for him. Promise. No. So we never know the variance of the treatment group before we run the experiment. We can always just think like maybe it will be a homogeneous treatment effect or we could I suppose speculate about how the treatment will affect the variants, but it's always gnarly. It's difficult to do. So what we do always in practice, I think everyone essentially is saying, let's presume that the treatment effect is homogeneous in practice. Of of course, when we start running the experiment, maybe the treatment affects only part of the treatment group, which will then disperse the distribution. If we have a beautiful distribution to start with, but some people get a large treatment effect, it will make the variation of that distribution larger. So the variance in that group will be larger. So the required sample size will go up. And so we do in confidence in our experimentation tool. We do both. So we have the pre experimentation sample size calculator which uses historical data to make this prediction, prediction. And then during the experiment we're also collecting the data from the experiment and running the sample size calculation continuously. Actually wrote a paper about that. I think there is a blog post about that too. If someone wants to nerd in on that. It's actually valid to look at the power during the experiment. It's a peaking that is non problematic. You can look at that anyway, so you have those and you might have a big discrepancy then. So when you start the experiment you might think that, hey, I can run this for two weeks, I will reach my whatever 10,000 users that I need. But then when you run it for a week, you realize that no chance I will reach much less or I will need much more. Maybe more likely I thought I needed 10,000, maybe I needed 40,000. And that's just not possible given the traffic that I have on this page. And in that case it might be then a conversation about, hey, how can we make this better? And so one way that we do it in practice is that we say, okay, maybe instead of us trying to predict it, you can point to a similar experiment if you know you have a similar experiment where you're changing the same kind of thing. But yeah, it's a tricky thing. It's a truly difficult problem to make good Sample size estimations.
D
And one thing that I found interesting, because there's definitely two different camps here, is that hopefully I'm not putting, correct me if I've interpreted this incorrectly, that you do allow for multiple success metrics in this, which I know makes that a little bit more complicated. And I think it also talked about adequately powering guardrail metrics, deterioration metrics, quality metrics, which not a lot of organizations do or have the capability to do. But that was like, oh, we're definitely gonna have to talk a little bit about that. But how do you handle the multiple success metrics, especially if you're looking at things further into the funnel that have a lower incidence? Like, how do you kind of think about that layer?
A
Yeah, I mean, this is a rich topic. We have a, we have a framework for this that we have developed over the years. And it's also a paper that is, I think, about to be published. It's on archive, at least where we go through exactly all of the details of how we're handling the multimetric that we call decision framework statistically. But I can give the short version of it. So essentially what we're saying is that we have an explicit decision rule for the multi metric setup. So we have success metrics and guardrail metrics. Right? So success metrics are metrics that you want to improve, and guardrail metrics are metrics that you don't want to harm. And so, for example, at Spotify, maybe we want to improve the music consumption, but we don't want to harm the podcast consumption. We don't want to do it at the expense of podcast, for example. So if you're making a new music recommendation algorithm, you don't want to harm any other consumption. And so the decision rule is essentially that at least one of the success metrics should have improved and none of the guardrail metrics should have been harmed. There are a lot of nuances here because for the guardrail metrics we're using so called non inferiority tests, which makes everything much more complicated to talk about. But leaving that aside, it means that when we're talking about power and false positive rate, we're talking about the false positive rate and the power for that decision rule. So we're saying we want that decision that we would make based on this rule. So at least one of the success metrics are significantly better and none of the Guardian metrics are worse. We want that to be the false positive rate of intent and we want to have the power to detect, given the sample size. So we have to make the adjustments for multiple testing corrections accordingly and then we have to make the power and sample size calculations accordingly. And there's some things to fiddle with there. But in principle, since the guardrail metrics all have to be not harmed, they are not giving you additional chances of succeeding. So you don't have to correct for them in the same sense. But at the same time you have to power them simultaneously because all of them have to show simultaneously that they weren't harmed. If you're using non inferiority tests, I'm deliberately avoiding going into much non inferiority test because it's like such a tongue twister to talk about. But if you're interested in.
C
You still said it eight times.
B
Good.
A
Yeah, yeah, no, but that was. No, but it's tricky. So, yeah, so that's how we do those things. So yeah, it's a bit messy, but.
D
So back to the culture side of this. How do you coach product teams to not just pick 50 success metrics because they are so excited about this new feature. Like, you know, it came from up high and we really want this to. We want to find some success. And obviously like there's a statistical part of it, like the correction, but like culturally, how do you guide that conversation away from. No, we shouldn't have. It shouldn't be like a pick list of up to 75 metrics to find something that went quote unquote up.
A
Yeah, yeah, yeah. No, I mean this is a conversation that we have. I think it's Spotify, it has settled. But like this is a conversation that we have from time to time and I think it's a, a sort of healthy discussion to have because it's not like ultimate. I think this is more tricky than it might seem like I want to give the answer that like, no, but of course you should just have a discussion and decide on the metrics. And I'll come back to that because that's ultimately what we a lot. But what we do a lot at Spotify. But there is more to it, right? There's also the fact that like we're making a lot of changes and we are truly interested in any kind of effect that it has. Right. I mean, it's a true statement that if actually this change that I made affected a metric that I didn't think about like some weird metric. Weird from my perspective metric, if that was truly the case, I would want to know. So from one perspective, I can really understand this. Like I want to look at all of the metrics and just see which one that I affected. But then on the other hand, you get this obviously super hard problem of like cursor dimensionality type issues here, where you just have like, you're looking from too much. So you're either just going to find noise or you're going to find noise and then you have to control that and then you're going to have instead very low power to find things. But I think there is merit to the type of experiments where you're just like, I just want to see what happens when I do this. And I don't really care. Of course I care what it is that happens, but I am ultimately interested in all things. But in practice, of course, this is hard. So again, at Spotify, it's not like the central experimentation team, which I'm part of building the tooling. We are not dictating these things. It's rather the other way around that we are. I like to think it about. I think, I like to think about it as that we are sort of cultivating what the teams that are doing experimentation are thinking about this. So we have a lot of discussions with them. So the way it works at Spotify is that we don't decide the defaults and how things should work in the platform. It's rather we talk to all of the product teams that are experimenting, the 300 teams in various forums, and then we collect what they're saying and we're refining it and then we're putting that into the tool. So when it comes to this, how many metrics you should have, there's not one answer. At Spotify, it's different in different parts of the organization, but in most of these parts there have been very explicit conversations where people have talked about like, hey, how should we trade off here? Actually getting super high precision in the things that we know we're interested in, versus getting interesting insights and stuff that we, you know, could be interested in. And this is sort of traded off in various parts of the organization and various projects, depending on how in what stage those projects are. If it's like a very new product, then you probably see, or you often see experiments with much more metrics because you're just interested in understanding what happens when we ship something like this. What kind of behavioral changes does this cause? Whereas when we're optimizing something, then we're like, okay, we know pretty well what we need to measure here to do this and to optimize this in a healthy way.
C
So Spotify, massive user base, a lot of the ability to design, to try to Cover and still be sufficiently powered. Seems doable. I'm thinking of a client we had that was kind of in that same boat. Like it still feels like a. The risk. The slippery slope fishing expedition of Let me tell myself a story that I just want to see if it impacted anything. And the new the the understanding required to that if you go on a fishing expedition, you are. I think if I understand correctly, you are. Your false positive rate is. Can go way up because you detect noise as a signal which then when you detect it you get really excited. Nobody can rationalize why this metric changed. It turns out it was noise. Now we've wound up doing like negative. We've learned something incorrect potentially. Unless you have the discipline to say if we're going to chase that we need to come up with a theory and we need to have the rigor to validate that theory before we accept it as fact. Like it that just feels coming from an analytic side similar sort of thing. If I just point the machine at all the data and it finds anomalies or finds patterns, there's a very good chance that it's detecting noise that just happened to hit at a point where it can show some statistical merit. Like somehow some part of me is just terrified. While I love getting comfortable with we looked for X, we did not find X. That is still a learning. And let's work with our business partners to acknowledge that's a learning and not have them just chasing for everything. That also feels like a challenge, you know?
A
Yeah, no, no, I mean I agree with everything you say but I also feel like, I mean I have the same uncomfortable feeling in my body when I think about this. Like, let's look at all of the metrics from a statistics perspective. But I also just like, I really wanna. I also think it's a cop out. Not projecting on you now, Tim, but for myself to say like, you know, to say like, you know, we can only look at the metrics that we decided before because like we decided that we found nothing. Let's move on. Because it's like, it's also obviously true to me somehow. Even though I can't come up with like, yeah, and this is how you should do it and this is how it won't lead to these incorrect learnings that you mentioned. But it feels like there is. It's a hard argument to make when someone says like, yeah, but I looked at some other metrics and I learned something. And then you're like, maybe you did, maybe you didn't. And I can think about Ways that you could do this, you could do like sample splitting and stuff. You could take one part of the sample and look for groups and then you could validate those findings in another part of the sample and stuff like that to make it much more plausible. Again, you would have the issue then of having lower power to actually find things or low precision at least. But I don't know, I just don't want to be like, I don't want to be too much of a purist.
C
Yeah.
A
Or like a grumpy statistician kind of person. But I do. I mean, I agree. I have the same feeling and I haven't seen anyone do it well, so what I've seen is that people have used the argument of saying like, yeah, we must be able to be able, you know, it must be possible to learn more and then just throw all of the metrics at it. And then I think they're just as well, like that's just as bad as not doing it, I think. So I don't have an answer to it, but yeah, maybe someone smart listens and then they can call me.
D
Yeah, let us know in the comments. I'm distracted.
C
And I don't know that this is the answer, but it does feel like, well, if you throw that at it and you find something, figuring out how to have the, the step, which is probably a combination of a data scientist or a statistician with the product manager to say we need to come up with a plausible theory as to what's causing that surprising thing and we need to have somebody with their bullshit meter turned on. Because I mean, I've certainly watched people find things and they come up with a theory. They're like, well, this is clearly happening because obviously like left handed people when they're in the southern hemisphere, it makes sense that they would prefer the color blue, you know, and something that's, that's, that it's a, it's a theory that fits the data, but it's not a theory that holds up to human scrutiny.
A
I think one, one thing that I'm excited about is replication. I think one way to, if you have a streamlined enough way to run experiments and you have your velocity of throughput for experimentation high, then one true possibility here is to replicate. To just say, okay, I looked broad and deep here and I found something. I believe in it. I think I've made my people in the southern hemisphere argument, but I believe it. And then I would. For anyone who would say I believe in it to the extent that I will now launch A new experiment, take 10% other people or a new random sample and run it again with only that metric or only the new metrics that I care about. And if I can repeat it, then I will ship it, then I would be like, yeah, go for it.
C
Or potentially, if the theory is, well, it was this kind of incidental thing that happened to be part of it, but it wasn't the core focus. Let me run an experiment where I've. Yeah, I've doubled down on that to say this should now. I should now really detect a strong signal because it's backing that up.
A
That sort of touches a little bit on the other blog post that you mentioned that has to do with what the intent with an experiment is. We haven't really talked about it yet.
C
Yeah, boy, I got giddy on that one too.
A
Yeah. Should you want me to give the TLDR on that one too?
C
Yes, please do.
A
Yeah. So the idea with that one is I often have, like, it has sort of come from a lot of the conversations that we've had with people running experiments, talking about the learning framework. And then people are like, hey, we have a lot of neutral experiments here. We run high quality experiments, but we don't, we don't find things. And so one thing that I've sort of identified from working with teams at Spotify, but also externally, other companies, is that people are often sort of starting to optimize the idea in their head before they've tried that the idea is at all something that will affect the users. And so what I mean by that is that people are, when they identify something that they think is like, this is important for our users, let's use a stupid example like a button color or something. We think it's important. And then immediately, instead of saying, okay, we should first answer the question, is it important or not? Do users care or not? Instead, they immediately start thinking about which color is the best. And so they jump from we have no idea if people care about this to having the conversation about which color is the best. So sort of presuming that people care at all which color this has, besides like having a high enough contrast so you can see it. And so this blog post was me just trying to formulate that, like the distinction between identifying if an aspect of your user experience is something that you can optimize, if it has sort of an effect on users in any way people care about it on the one hand, and optimizing that once you have identified that it's something that people care about on the other. Hand. So sort of identifying something versus optimizing something. And so I think that this thing that we talked about now is a little bit about, you know, maybe if you run an experiment where you thought something, you thought that it was important with some aspect, or you try to optimize it and then you find something new, some metric that you didn't anticipate to move, that might caused the sort of idea in your head to be like, hey, maybe there is a mechanism here that people care about. Maybe people actually care about how many items we show on this screen. I was thinking about the ranking, but as a side effect of that, we showed more things. So we saw that, I don't know, lower down at the list, clicks increased or something like that. And maybe that's an indication that this is a mechanism that people care about. And so I think this going in between the states of identifying something to optimize and optimize the thing you have identified and doing that explicitly and deliberately is something that a lot of product teams would benefit from. It's easy to fall into the trap of trying to do both at once.
D
I think totally.
C
Is it cousin to the optimizing? I mean, the framing of say, which is kind of a. I think it might even be in the article, like the case for taking a bigger swing. Take the big swing first. Make sure that connects. Even if it's a, you know, in a. In a while it's like, yes, this, there's something here. Now we can tune it. And that. I think of it from a, I mean from a marketing analytics perspective where companies will say, let's just, let's just try it out and see what happens. It's kind of a death knell. It's going to be an under investment in a new channel or a new tactic where logically it's going to be really hard to detect a signal because it winds up getting kind of tempered down to a pretty subtle change. And the logic is, well, if this thing actually matters, then we can make a nominal investment and we'll see this outsized lift. As opposed to say, saying, does this matter at all? Double down on it for some period of time, go hard, see if you actually see something and then say, okay, we definitely need to be in this channel or using this tactic or doing this to the user experience. Now we need to sort of figure out did, did we actually spend twice as much as we needed to? We can get the same, like, where are the diminishing returns? Like that. It does feel like culturally it's a Tough. Human nature is risk averse. And so saying try something and know that you'll find that it didn't. It is okay to find that it didn't work. A big swing with a neutral result feels like it has a lot more merit than a little small tap with a neutral result. You're like, yeah, that's the fun in that.
A
No, that's precisely it. I think that's. So this actually what provoked me to write it was discussions about the neutral outcome in the Learnings framework where people are like, People are like, yeah, but neutral is not fun. I don't care if it was powered or not. I don't want neutral. And that got me thinking, well, if you don't like the neutral result, it means that the question you posed wasn't interesting enough. Because I would be like, if I'm convinced as a product person that people care about this thing in our app, if I change this, people are going to care and then I make a drastic change and nobody cares. I run the experiment, I have high precision in my estimates and nobody cares. If that's not the learning to be excited about, I don't know what is. To be honest, that really shows that I'm 100% off with my understanding of what people care about, which is truly strong learning. But if on the other hand, this change that I made was like, yeah, I really think our users care about this aspect and I made a minuscule change to it and I didn't find anything, then it's like I might think for a long time about if this was the right change that I made or if it was. You just get stuck in, in weird things. But one way that I have sort of solved that, because I agree that people are risk averse, is that to run both, like if you run an A B test, people tend to want to be like. But I think what users like, I think I know what users like. I want to go for the identify and optimize at the same time version of this thing where I like try to choose the right value for my customers or my users. But I also say, like, just also add. Then if you haven't actually identified that this is something that people care about or that matters for your business or where it might be, add the more sort of provocative version. I call it maximum viable product, I think, which is just like, because of course this has to be reasonable. I mean, if you make some button larger than the screen, then of course you're going to see some change. So it has to be within the Sort of limits from what is still a usable function. But that is still extreme, right? So the maximum change that you think is like, but this is still like, this is not comedic. How do you say it? Comedic?
C
You're saying, doing that within kind of a multivariate, say we've got our control, we've got what the optimized and identified at the same time version and then we have an identify only version and it's okay if that identify version detects the biggest effect. You can say, yeah, that was kind of hedging to make sure that, that we got something out of it. And if that one that was identified and optimized simultaneously didn't, then we're probably still on a good track. It just turns out we're not so omniscient that we can come up with the perfect variant in one shot.
A
I think it's smart also from a, I mean a lot of companies, at least Spotify and other companies that I work with, they're all struggling with like having big enough sample size, right? Because they are both because they have limited traffic, but also because they're interested in small effects, generally speaking. But the nice thing about making a very drastic change is that it should have a large effect, right? If you're making this maximum viable change, then that should cause a large effect. So you should be able to say like, yeah, but like now I pull this lever as hard as it's possible to pull. So you know, this should cause maybe 5% change like whether it's good or bad. And so you can maybe run smaller experiments. I think like if you're in a situation where it's hard for you to know what you should like, you have a hard time finding bandwidth essentially for optimizing things, then I think it's a smart idea to do these more drastic changes to identify what you should then spend larger experiments on optimizing. Because the truth is that when you start optimizing, even if it's a nice convex surface for this thing, button size or something, then the higher you, the closer you come to the optimum there, the larger samples you're going to need to be able to identify those steps.
D
So yeah, it seems like. So the framing that I really liked in this article is like the building the right thing versus building the thing, right? And it feels like the stakes couldn't be higher in everything you guys are just talking about in a product context because it's not just about changing a button color in a lot of cases, right? This isn't about UX it's about adding additional features or different capabilities and you're hoping to impact things like customer lifetime value, not just did they get to the next screen. Right. So it's not, it's not just like checkout flows. Right. I think I was thinking about this. I've actually spent more time than the average human should thinking about the changes that have been happening lately inside of my United app. So I'm a United loyal, I fly United. And the app has been changing a ton lately. And we went from. There was one place where I could change my seat to every single screen within this app. I can change my. Which I do appreciate. I'm definitely someone who loves feeling a lot of control over changing my seat. But I'm like, what were the conversations that happened internally that said, you know what, the user needs to be able to change their seat while they're checking in their bag, while they're checking to see, you know, what gate their flight is at. And so anyways, just to bring this back to an actual question, building the thing right, and like, maybe the feature is great, the new functionality that you're adding, but maybe you've, you have gone about it the wrong way, which has impacted the ability for someone to understand what exactly this is capable of. Maybe it was like a micro copy issue or maybe it was in the wrong place in the flow, which feels more like optimization. So even though this framing and like, big swing versus small change, like that sounds really objective if you put them side by side, like, that's clear. But how do you guide like. And I'm especially interested because now you are in a product role to get a little meta about it, like, how do you think about what is. You know, when would you ever recycle a concept in a different context? Because it does feel like the optimization killed your ability to understand if it was viable.
A
Yeah, yeah. I mean, the truth is here that this is difficult. Right. I mean, I think this, especially starting with that building the thing right versus building the right thing. Like some things you have to do quite a lot of building to even check if it's the right thing. Right. I mean, if you're building a new feature, there might be a lot of things that you have to get in place to even see if it's something that people cares about. And then once you've seen that they care about it, maybe they don't like it. And that's because you haven't built it right yet. This is a very stylized blog post, of course, but the truth is much more Muddy in practice. I think that. I've always said that, too. I think that one of the things that have been discussed a lot at Spotify and other places is like, okay, but with experimentation, where is the. Is there room for the product, like intuition and making bets on things and stuff like that? And I've always liked to say that these are completely. They're augmenting each other, they're helping each other. You can have this strong intuition still, and you can make these bets. What experimentation helps you with is actually validating that your bet was good and helping you change your direction if it wasn't good. And so what I'm trying to say is that of course, sometimes, and maybe not even rarely, when we're building experimentation tooling, we have to build for quite some time before we can say. Before we can answer either of these questions. And it's hard to disentangle them even. So let's say that we build a completely new feature for experimentation, then some new methodology or something. It's hard to even have. What's the dimension? Along here I can test if this is a lever worth pulling. Like, that's maybe a question more for market research or user research or those kinds of things. Yeah, so, yeah, so that's the truth. I think it's just a lot of, like. I think the teams that I'm writing this blog post for that I'm thinking about are the teams that sort of have a product already and they've been owning it for a while, and they feel a bit stuck in terms of, like, they're not getting the sort of return of investment rate that they would like from their expectations. They see that they have a lot of neutral results and they're wondering if they should run much longer experiments or what they should do about it. But, yeah, I don't know. Felt like partly cop out from your question there.
D
No, no, it's good. There's no clear conclusion.
C
Come on, Val. He basically said that it's like intuition with experimentation combined. It's kind of like you need to combine the facts and the feelings.
D
I knew exactly where you were going with that when you said come together. Cheesy. So, okay, so before I lose the thread, because I.
C
This is your last question, by the way, because we're.
A
We're.
D
Don't do that to me. No, no, no, no. I've got like three more, but I'll go fast. I'll go. We'll go fire round. Okay, so you're talking about. No one really likes the neutral results. Talking about Some intuition with product, I'm going to talk about those outcomes. So obviously if there's a win positive outcome, it ships. If it hurt the experience, it doesn't ship. If there was an issue with the test setup, you hit an S. Whatever. Doesn't ship neutral. I want to talk about that. So are there scenarios where the product intuition says even though this was neutral, it makes sense for where the roadmap is going or some decisions we're making from branding, like maybe we're. This is building towards a bigger bet in the larger ecosystem to make things, you know, easier to share, more social. Like how do you think about the ship or no ship kind of action as it relates to those neutral results?
A
Yeah, it's a great question. So my general recommendation there is that as long as you've decided before you run the experiment that you're going to ship if it's neutral, I'm all good with it. I think this is just like. I think that there's a ton of situations where it makes sense to ship something if it didn't change anything, especially if you're building infrastructural type changes or if you're building towards something. Right. I mean, we're building a lot of. At Spotify, building out AI features as everyone else, I suppose, but there's a lot of changes that we're making to our infrastructure just to be able to support features that we're planning to build. And when we're making those changes, the idea is that we're hoping that nothing will change. Maybe we're doing stuff to make things faster or something like that. But that's like, it's a bonus if it's. If it changes anything at any point, we just want to avoid. So there's a lot of changes that we are expecting won't make any difference. So what we do then is that we essentially run what we call rollouts, where we only have guardrail metrics, actually. So we say as long as we can prove that we didn't harm these metrics, we're going to ship it. So then by using the rollout, you're sort of declaring your intent from the beginning that, like, hey, we're planning to ship this as long as it's not bad. Which can sort of be a quite nice way to say to just make it explicit that, like, that's completely fine. But then again, I think that just want to add a small caveat here that they also. I've heard a lot of product people at Spotify and other places talk about, you know, that like, even maybe if a metric doesn't look great or if it's neutral and stuff like that. There's this, I think, almost human fallacy to say, like, this is strategically important, let's ship it anyway. And so I think even though that's true, and I think that's why it's sort of an easy fallacy to fall into, or like, it's an easy trap, that can be true, but I think everyone should think about, like, how large proportions of the things we ship should be shipped from the argument. This is strategically important. Like, pretty small proportion is my.
D
You only get three per year, Everyone gets three a year.
A
Yeah, something like that. Right. I would love if I could give people a budget for those kinds of things. So it's just. Yeah, no, but it's just. So I think it's like it's all about just being very, you know, trying to really avoid the pitfalls of changing the objective when you see the results. But, yeah, so, I mean, we do that all the time. At Spotify, we're shipping a ton of things that are neutral and a lot of them are shipped with rollouts where we just explicitly say, you know, we are planning to ship this thing for some reason, like, it might be business strategically, or like we have to improve our backend to scale for more traffic or whatever it might be, we're going to ship it. So we just want to know that we're not harming things.
D
I like that. Okay, so, Tim, I'm sorry, I have to sneak in. So what you're talking about here is a very nuanced. It feels like a nuanced analytical discussion. Like, should this be a rollout or how should this be exactly validated? How do you think about the education? Because you're not talking about an audience of 400 people who are deeply steeped in the analytics or like the rationale for why you'd make some of those choices. How do you think about the education piece to these different product teams?
A
Yeah, I mean, it's super important. So, like, I've spent, I wouldn't say majority, but a very big portion of my time at Spotify building educational material and mechanisms for this. I think that, like, we have, I think, two strategies for this. I think it's. The first one is to keep the tool as simple as we possibly can, so have as few options as possible. So we're talking about a lot of nuanced stuff here, but we also have removed a lot of stuff from our platform and simplified a lot of stuff and removed a lot of options. So made it quite opinionated to minimize the things that people actually have to understand and know. So that's one side. On the other side is that we have very explicitly and deliberately built educational material and tooling for experimentation for many years. So with confidence, we have this whole boot camp of self serve courses. We've also given a bunch of courses. We have something called Quick Starts, which is a very basic tutorial for. This is how you run an experiment, this is how you run a rollout and those kinds of things. So I know it's a super important thing, but I think it has to come from two sides here. You have to try to make the thing that people should learn as simple as possible because people don't have time. People have a lot of other things that they need to be good at and learn and understand. And then you have to create the material so that they can learn those things that they have to learn. So that's our solution to that. And I think, I mean, we thought a lot about that. There's a lot of things that everyone that joins Spotify is onboarded to experimentation immediately and they go through certain, what's called golden paths at Spotify, which is like onboarding to certain things. And so if you're a mobile developer, then you learn how to work with our feature flags in mobile and you run an AA test as part of your mobile engineer onboarding, for example. So there's like we have infiltrated the whole organization with experimentation, onboarding and materials and that has helped.
C
Wow. Well, and Val, I'm going to have to put some duct tape and we're going to have to move to wrap. You're like, but I just have seven more Val in the role of Mo Kiss on this episode.
D
Yeah, right.
A
No, I have zero stress at least, so. So don't worry about me.
C
Well, this, I mean, this great discussion. I love sort of the thinking about what are we doing, why are we doing it? And how can tooling and education and culture and framing all sort of come together. So thanks for coming on for this discussion, but before we leave, the last thing we do on the show is go around the horn and we share a last call, something that might be of interest to our users. And Martin, you're our guest. Do you have a last call you'd like to share?
A
Yes. So one thing that I'm completely like, I have been actually for many years, but now renewed is the YouTube channel three blue one brown.
C
Yeah, Julia hits on that.
A
Yeah, yeah, it's yeah, if I'm not the first one, that just makes me happy because this needs to be like, it's the best. Like, it's. It's the. The thing that I'm particularly thinking about now is the videos on transformers and LLMs. So where he's. So this YouTube channel is essentially a channel that visualizes a bunch of math. So it's. That sounds maybe not fun, but it is so insanely good. I think they have a long series on linear algebra that I think if I would have actually seen it when I was taking linear algebra would have helped me a lot. But they also have a bunch of super, super nice things on LLMs and transformers, which I think is like, if you are like most people hearing that word many times and you have like, yeah, it's some kind of neural net, maybe I have used the neural net once or twice, but you have no idea really how it works. Those videos are. Are so very, very good. So I recommend them highly.
C
We have reached out to have. We had an exchange trying to get him to come on the show. I think it might have been around to talk about neural networks. He was in the process of moving, so he's on our list to try to get him on.
D
Good reminder.
C
That's a good one. Good reminder to go back because they are. I've sampled some of those and I'm like, this is so clear. And how does a human being have the time to could produce something like this?
A
Yeah, I mean, Grant Sanderson, who has that channel, I mean, he seems to be like one of the true geniuses alive. Just side note here is like he's doing these super nice animations of math and he just built that library himself. The library he built. It's just.
C
Come on, we're going to do. We're going to use this call out when this comes out to reach out to him again and say, hey, come chat with us.
A
I would listen 100% awesome.
C
Val, what about you? Do you have a last call?
D
I do and it's actually related to today's episode. This is a Medium article published on UX Collective article written by James Skinner. It's called escaping the AI. Why MVPs should be delightful. And there's a lot in here. One of the cases he makes is that like using AI is just like regurgitating. Like we're not going to get to that delight level if we're just, you know, using AI to help, you know, develop those different. Net new kind of versions that are being tested within a product context. But he talks about the MLP instead of MVP. I'm kind of obsessed with MVPs, Martin, I should tell you, just understanding different people's perspective. But the MLP is the minimum lovable product. And he also referenced one, the minimum viable whatever, because there's so many acronyms related to this, with people trying to figure out exactly what that level of fidelity should be, what type of investment you should make before you experiment. But he does talk about experimentation at the end, which I do love, but there's a lot of really good examples and I love reading from that design product perspective. So. But it's a, it's a good read, about 10 minute read, so it's a good one. And Tim, how about you? Do you have a last call for today?
C
I've got a smidge of housekeeping and a last call. So we are now like into month number two of 2026, which means we're heading into conference season. Actually, I am sitting in Budapest, Hungary as you were listening to this, if you're listening to it when it came out. But a couple of kind of Analytics Power Hour conference attendee appearances coming up in Nashville. If you're in the states. There's the DataTune conference that Val and I will both be attending on March 6 and 7, and the some critical mass of the Analytics Power Hour crew. We will be recording a show with a live audience at the Marketing Analytics Summit in Santa Barbara, California on April 28th and 29th. So those are PSAs more than last calls. My last call would be friend of the show, past guest Katie Bauer. The wrong but useful substack wrote a post called the Next Data Bottleneck, which I thought was kind of a unique and really thought provoking take on the whole drive towards conversational analytics. And not kind of the will it or won't it or the technical challenges of it, but when looking at what people are asking for and why, they actually seem to be kind of mundane requests, that they seem to be kind of just simple data fetching requests, not these super nuanced things. So she has a lot of musings that can be a little unsettling for the analyst. But then she actually kind of wraps by making the case that really it goes back to good analysts really thinking about the business deeply. So it's a worthwhile read. So that was a threefer. But I labeled two of them as being housekeeping last calls.
D
Can I ask one more question then?
A
Yeah. That's how you get airtime in this show.
C
I'm drunk on Power as Michael is drunk on Terraflu Tamiflu Tamif. What's the I don't know what flu medications are. Yeah, by the time this comes out he will be back to good health and he will vow to never get sick again and seed the mic to me. So this was great. Thanks again Martin for coming on. This was a really fun discussion.
A
My pleasure was really nice. Thank you so much for having me.
C
Awesome everybody. Get your Spotify subscription up to speed. This is what's driving Spotify's next round of growth is the Confidence Podcast appearance Quarterly call coming.
A
So please There you go.
C
Perfect. If you are listening and you've enjoyed this show or other shows, we would always love a rating and review. You know, do a little call and audible and read out this one from Apple Podcasts that jesst 5272018 left. It was titled Smart and Funny and it was. Love the insights and laughs I get from this podcast. You all have a high blood bar for analysts and the value they can add, which I so appreciate. And you share all of that perspective via hilarious and authentic banter. Keep it up. Wait, let me check. That is our podcast. Yep, that is this one. So that was kind of nice. So we'll always love to get ratings and reviews. Theoretically that is how we expand the reach of the show. That and recording video and putting them on YouTube. So we'll just double down on the ratings and reviews. If you're a fan of the show and would like to have a sticker for your laptop or water bottle or whatever, you can go to AnalyticsHour IE and request a sticker. We'll ship one over. If you have something to say, a thought for a topic, criticism, your own little witticism that you'd like to share, you can reach out to any of us or the show as a whole on LinkedIn. You can catch us on the measure Slack, or you can just send an email to contactlyticshourio. So with that for Val and for Michael in absentia from his sickbed, I'm Tim Wilson and no matter what your reason, whether you're identifying or you're optimizing or you're being just aggressively neutral in your findings, you should always keep analyzing.
B
Thanks for listening. Let's keep the conversation going with your comments, suggestions and questions on Twitter @NalyticsHour, on the web at AnalyticsHour IO, our LinkedIn group and the MeasuredChat Slack group. Music for the podcast by Josh Crowhurst.
C
Those smart guys wanted to fit in, so they made up a term called analytics. Analytics don't work.
B
Do the analytics say, go for it, no matter who's going for it. So if you and I were on the field, the analytics say go for it. It's the stupidest, laziest, lamest thing I've ever heard. For reasoning in competition.
C
Yeah, I. We've sent. Australia is the one. That's the real Australia. Singapore will take weeks. Singapore. Singapore 1 made it all the way to Singapore.
D
Came.
C
Came back to Ohio. Never came to me. Turned around and went back to Singapore. So it was like the box was, like, smashed.
D
The gift wasn't ruined, but the box was in shambles.
C
There is now more packing material. I did change after seeing that.
D
It's a process update.
A
I guess I should save all my comments about it for the actual recording. Yeah, yeah.
D
We'll get into it for sure. I'm very excited.
A
It wasn't that terrible. The distortion wasn't that terrible.
D
So every time you do that while we actually record. Because you'll definitely be doing that multiple times. Just kidding.
C
Yeah, yeah.
D
Blasphemy.
C
Okay.
A
Part of your signum yelling at your.
D
Your guests.
C
All right, let's try it again.
D
Rock Flag and focus on those learnings.
Release Date: February 3, 2026
Hosts: Tim Wilson, Val Kroll
Guest: Martin Schulzberg (Product Manager & Staff Data Scientist, Spotify)
This episode centers on how organizations, particularly Spotify, move beyond traditional "win rate" metrics in experimentation to focus on a richer, more informative "learning rate." The discussion features Martin Schulzberg, co-author of Spotify’s "Beyond Win Rate: Experiments with Learning Framework," delving into how to measure real learning in digital analytics, how culture and tooling shape this, and the nuances of experimentation at scale.
"I felt it was sort of under celebrated ... all of the other types of wins that you can make besides finding something better." – Martin (04:22)
"Neutrality ... is informative because you can say, hey, maybe this is not worth pursuing because we actually ran a proper experiment..." – Martin (06:32)
"If we're finding a ton of neutral results … maybe we're hitting diminishing returns and we should try something different." – Martin (12:25)
"...at least one of the success metrics should have improved and none of the guardrail metrics should have been harmed." – Martin (23:01)
"...we are sort of cultivating what the teams ... are thinking about this." – Martin (27:00)
"If you go on a fishing expedition, your false positive rate can go way up because you detect noise as a signal." – Tim (29:10)
"One true possibility here is to replicate ... if I can repeat it, then I will ship it." – Martin (34:09)
"If you don't like the neutral result, it means that the question you posed wasn't interesting enough." – Martin (39:44)
"...as long as you've decided before you run the experiment that you're going to ship if it's neutral, I'm all good with it." – Martin (49:50)
"...infiltrated the whole organization with experimentation onboarding and materials and that has helped." – Martin (54:40)
On moving past win rate:
"...all of the other types of wins that you can make besides finding something that was better than the current version ... doesn't really reflect how most companies are actually using experimentation." – Martin (03:45)
On being culturally open to questioning metrics:
"There's always people questioning everything at Spotify, which is one of the things that I love..." – Martin (08:01)
On replication and exploration:
"If you have a streamlined...way to run experiments ... one true possibility here is to replicate." – Martin (34:16)
On the importance of big swings:
"A big swing with a neutral result feels like it has a lot more merit than a little small tap with a neutral result." – Tim (39:37)
On managing neutrality and shipping:
"We are planning to ship this as long as it's not bad...by using the rollout, you're declaring your intent..." – Martin (51:15)
This episode offers a nuanced, inside look at how Spotify and leading analytics practitioners are raising the bar for what it means to “learn” from experimentation. Rather than being tied to simple win/loss metrics, teams are encouraged to think strategically about all outcomes, the value of neutrality, and the absolute requirement for rigor, intent, and thoughtful culture at every stage.
For those involved in product analytics, experimentation, or building data-driven cultures, this episode is full of practical insights—and a reminder that learning is always at the heart of progress.