
How is an outlier in the data like obscenity? A case could be made that they're both the sort of thing where we but that can be awfully tricky to perfectly define and detect. Visualize many data sets, and some of the data points are obvious outliers,...
Loading summary
Tim Wilson
Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language. Hi everyone. Welcome to the Analytics Power Hour. This is episode number 269. I'm Tim Wilson from Facts and Feelings and Confession Time. I am MAD about mad. And by mad, of course the latter one. In that statement I mean the acronym mad, which stands for median absolute deviation. I came across that technique, I think eight or nine years ago. Actually needed to go look in GitHub to figure out like my timestamp on that, when I needed to be able to detect likely outliers in like hundreds of very small data sets. Like just 10 or 10 or so data points each. It's a long story. It's not important. But median absolute deviation is one of many, many techniques for detecting outliers, and outliers are the subject of today's episode. With that in mind, we're operating very much within one standard deviation of our mean host count and that I'm joined by two co hosts for this episode. Joy Hoyer from Further do you have a secret favorite outlier detection technique?
Joy Hoyer
No, I don't. Maybe I will after today.
Tim Wilson
Right down the middle. Okay, I. Yeah, it's unhealthy. My, My MAD affection, but we'll, we'll get into that.
Val Kroll
But very reasonable. It's an excellent choice.
Brett Kennedy
Yes, validation.
Tim Wilson
Okay, now I'm validated. And we're done.
Val Kroll
That's the whole show's Valkyrie.
Tim Wilson
And Val Kroll from Facts and Feelings. Do you try to just operate pretty close to the mean as a matter of course?
Brett Kennedy
Oh yeah, you know me.
Tim Wilson
Nope. There was definitely not a conversation about how cold water needs to be in order for it to be drinkable right before we started this recording. So yeah, so we're likely going to be chatting about some different outlier detection techniques. Shout out to our listener survey last year that asked that we dip our toes into specific techniques and methods here and there. But we're also going to get a little more philosophical perhaps about when and where and how outlier and anomaly detection are useful. For that discussion, we needed a guest. The perfect guest would be someone who devoted so much time and thought to the subject that they'd written an entire book about it. But that'd be crazy. I mean, some such a person would surely be an outlier themselves. Oh, wait, wait. What? What. What's that? Oh, our producer is telling me we found such an outlier. That was the voice you've already heard. Brett Kennedy is a freelance data scientist and he's the author of a book outlier Detection in python, which is 17 chapters of Outlier goodness. And I'm not just saying that because median absolute deviation. MAD gets introduced and described in chapter two today. Brett is our guest, so welcome to the show.
Val Kroll
Well, thank you very much.
Tim Wilson
All right, so maybe a good place to start is with how you came to write an entire book about outliers. I think I had a good sense that there are multiple different techniques depending on the nature of the underlying data and kind of what the question at hand is. But what sparked you to think, hey, there's a whole book's worth of material here and I should be the one to write it?
Val Kroll
Yeah, well, yeah, there definitely is a whole book's worth of material there. There's. I mean, there are a lot of techniques that are used, can be used to identify outliers, and they all have different nuances to them. There's a lot of kind of gotchas with outlier detection. How it started in my end is I was working for a company and we were building software that was performing outlier detection. The software we were building was to be used by financial auditors. So the idea is someone who's performing a financial audit on their clients. They would go through their client's data, which include their bookkeeping records, as well as we had code in there where they can check text documents, contracts and invoices, meeting minutes and the like. And we introduced outlier detection for those as well. But we started with the bookkeeping data. And so what often happens when auditors perform a financial audit is they'll get a set of transactions in the bookkeeping records or sales purchases, payroll records and the like. And there will often be millions or hundreds of millions or billions of these records. So it's not really feasible to manually check them. So what can be done in that case is just to randomly check, spot check them. Or you can use outlier detection to try and find the ones that are the most unusual in there, with the idea that those are at least among the most relevant records to check. Auditors would also want to check the ones that have large financial values and those other kind of criteria as well. But the ones that are the most unusual can be the most likely to suggest either errors or fraud or working outside of the normal controls and situations like that. So when we were writing the software, it's kind of a different experience when you're writing software that's going to be given to other users where they can be encountering just any sort of data, who knows what. And we won't be there to be able to update the software as it's needed or work them through the issues. So we had to make software as very robust going out the door that can handle basically any situation and can provide meaningful explanations to what's being found. So we were doing that. And at the same time we're also working with regulators around the world. So the regulators are organizations each, most countries have them that regulate how financial auditors work, how they perform financial audits. And they were interested in looking at this kind of new wave of technology that companies like ours were creating and things like outlier detection. And they were raising a lot of important questions, like a couple in particular. One was when your software flags an outlier for an auditor, how's the auditor going to know why it's unusual? It'll just say, look, this sales transaction is statistically unusual. Okay, why? And also a point they were making was if the software generates flags a bunch of transactions as unusual, with a little bit of effort you can go through those and say, okay, these are or are not statistically unusual, these are or not interesting. They do suggest fraud or errors or circumventing controls or something. But how can we check the converse? How can we check all the 99 point something percent of the transactions that weren't flagged? How do we know there weren't transactions in those that were even more unusual or more important to flag? And those were a couple really good questions. And I think that really got myself and the team, you know, thinking deeply about interpretability and testability and outlier detection. And we got into some areas that I think, you know, there, there really weren't other people doing much. I mean we, you know, I went through a lot of academic papers, I think like 200 or something just to try and figure out what the state of the art, no having, I should say LR detection literature is, is kind of a lot easier to read than a lot of other areas of computer science.
Brett Kennedy
Well, that's interesting.
Val Kroll
Yeah, that's, that's, that was my experience anyways. Now maybe that was an artifact of, you know, once you've read the first hundred or something.
Brett Kennedy
Hundred are a breeze.
Val Kroll
Yeah, they don't seem as bad as, I mean they're still, they're still, they're a little shorter and a little more manageable than is say deep learning or some, some other fields which aren't, you know, awesome often manageable. But yeah, so I went through a lot of the literature and kind of came up with a good sense of where the state of the art was, but also we were doing some kind of original research in terms of making outlier detection interpretable, explainable, and doing work around testing these things, which is actually a difficult problem. I mean, it's solvable, but it is a difficult problem.
Josh Crowhurst
Picture this. You're stuck in the data. Slowly you're wrestling with broken data pipelines, manual fixes suddenly streaking across the sky faster than a streaming table, more powerful than a SQL database, able to move massive data volumes in a single bound. It's not a bird, it's not a plane. It's Fivetran. I need a hero for data integration. Fivetran, with over 700 pre built, fully managed connectors, seamlessly syncs your data from every source to any major destination. No heroics required on your part. That means no more data pipeline downtime, no more frantic calls to your engineers, no more waiting weeks to access critical insights. And it's secure, it's reliable, it's incredibly easy to deploy. Fivetrain is the tool you need for the sensitive and mission critical data your business depends on. So get ready to fly past those data bottlenecks and go learn more@fivetran.comAPH Unleash your data superpowers again. Fivetran.comAPH Check it out.
Tim Wilson
So it's, I mean, to me kind of the nature of, and I had a question when I was kind of starting to scan through parts of the book as to what the difference is between an outlier and an anomaly. And the book you said, I think that you, you kind of use them interchangeably. Which I was like, cool, this isn't, this is an easier, easier thing to read than I would have thought if there was some. So maybe that backs up your point. But it, it seems like the concept of an outlier is one of those where, you know, if you, if you know data deeply and you look at it, you'd say, oh, well, that's clearly an outlier. Like it's one of those, it's, it's this kind of nebulous idea. And the question is, well, and the brain may be with, with certain types of data, be able to kind of look at it or sense it or say, well that's, that's that temperature reading of, you know, the peak temperature was 150 degrees centigrade or Fahrenheit, either one, you know, clearly is an outlier. You know, that's just logic. But is that like a fair way as you work through different techniques? It's kind of like what would have a Human logical interpretation of, well, that's out of the norm in something we'd call an outlier. But it's not like it's a definitionally precise line. I'm like, it kind of can't be, right.
Val Kroll
No. And it's not. No. That's one of the challenges of outlier detection is there's really no direct, definitive definition of an outlier. And one of the consequences of that is there's a lot of different ways to try and find outliers. Like you mentioned, median, absolute deviation, that's one. I go through a couple dozen others. And there's a reason there are so many outlier detection techniques. It's just there is no. If there were, we would just have one outlier detection algorithm and that would be it. It would just say for sure these things are statistically unusual and these things are not. And what makes difficult as well is there's kind of a difference between what's statistically unusual and what's useful, what's interesting. So, like with the example I just gave, with going through financial data, you can find things that are statistically unusual, but there's nothing noteworthy about them. They just don't occur that often. I think, like in the book I gave the example of annual payments. Companies will have these payments to make once a year. So they're unusual because there's only one per year, but there's nothing otherwise interesting about them.
Brett Kennedy
Can I actually just call out something about that? Because I actually love that. That was one of the things that I pulled out that it's about when you're looking at those transactions, that some of those annual payments, because they are once a year, that they're not problematic, but they're understood by the analyst. And so I thought that that was really interesting because when I think of like outliers and all this detection, I think of a lot of things that are happening not always with human touch or human touch considered first, like the analyst role in all of that. And so it just really struck me as like a. An interesting concept that it's. It's that there's processes and, you know, different methodologies, but that it really does take the interpretation to really kind of even say whether or not it is an outlier, because there's no definition. And I was like, that is. I had never thought about that before. I thought that was super interesting.
Val Kroll
There's still a very important role for an analyst in all this. And financial data is a little more straightforward than some. If you get into a lot of scientific data and we get into other domains too, like get into like image data or video, audio and so on. It can really, really require an analyst to take a look at that to say, okay, this is something we're interested in pursuing further, or this is not now depending on the domain. Well, somewhere that might be even a little bit more straightforward than financial data would be where you have sensors to monitor industrial processes. I mean, that's something, that's another application of outlier detection. And if you have a certain assembly line or something like this, it could be well understood what the normal behavior of the system is. And really anything outside of that is probably worth looking at. But even there, you can have certain things that are statistically unusual that are just known to not be of interest, but they can reoccur. So you might need an analyst to look at that and say, okay, if this specific combination of sensor readings over this time window occurs, this is known to be of interest or this is known to not be of interest. So the analyst may only have to look at it once. And going forward, the system may be able to say, okay, this is, you know, of high importance or you know, mid importance. So we'll send out alerts, but we won't shut down the system. Or it's known to be a non issue. Then it can do.
Brett Kennedy
When you were rereading the credit card fraud detection, it reminded me one time when my fraud alert went off on my credit card because it was a purchase made in the suburbs.
Val Kroll
Okay.
Brett Kennedy
That credit card company knew was like, that's not right. She never go out to the suburbs. And I remember like, be like, no, that, that was me. It was an accident though. So be sure to flag this again in the future. Something has gone wrong.
Val Kroll
Yeah, yeah. I think like, I think a lot of us that are can remember too, like credit, the fraud detection in credit card is now, it's not perfect, it's a very hard problem. But it works so much better than it did say 20 or 30 years. Like it used to be like, you know, something like that. Even people that did go to the suburbs routinely, it would still, it would still like to shut them down, shut down the card. It's like if it had seen it before.
Tim Wilson
Yeah.
Val Kroll
So it's a little bit smarter. They're a lot smarter now. But yeah, it's a very, very difficult problem because, you know, there's a lot of kind of subtle ways people can, you know, if they have a stolen card or something can go about committing credit card fraud.
Joy Hoyer
Yeah. How so? Going back a Little bit to the methodologies because you said, you know, you were able to write about a lot of them. You know, it's not just like a handful, like, oh, three or five, like there's over 10. It sounds like at least of methods. So when you, in past scenarios or in general, when you're trying to decide which method to use, what are you using to help you make that choice? Are you looking at attributes of the data? Does it have more to do with like the business context of the data? Or is it even. I'm thinking, like, is it the, the use of the actual outlier once it's detected? So are. Is an outlier bad and you want to remove it? Is it good and you want to understand why it happened? Is it, you know, or it's like, is there a big issue with like a false positive of an outlier? Like, how sensitive is the scenario to that? Like, I'm just wondering, like, what goes into helping you consider and choose your method.
Val Kroll
Right, right. Yeah, well, yeah, well, just to take the very end, just quickly is, yeah, there's definitely a concept of false positive and false negatives. You know, type 1, type 2 errors without liar detection. You can over flag things or under flag them. And interpretability is very, very important. It's usually much more important than in other areas of machine learning. Like if you're creating a predictive model or say a generative model, where it's just, you know, trying to, if you have a model you want to say, you know, produce a picture or produce a sound clip, you don't necess. I mean, if you're debugging it because you're the person developing it, you would need to know why it did what it did. But if you're a user, you don't really need to know, you know, why it generated what it did or if it's making a prediction about something. You don't. In some contexts you do and some you don't, but you often don't need to know exactly why. But it's very, very common in outlier detection to need to know why it flagged something, it gave something a higher, lower scores and something else a lower score. And that can really come into your choice for which detectors you use, partially because unfortunately, very few of them are outlier, are interpretable. They tend to be black boxes, which, and, you know, you can imagine, like, if you're using outlier detection for, like, if you're investigating fraud or security or you're looking for criminal activity or something like this, or even if it's just you get an outlier detection has, you know, flagged something says, oh, your industrial system is behaving unusually, you should shut it down. Well, you need to know why so you can investigate that quickly. And if there's a security threat or a safety threat or something like that, you want to be able to investigate it quickly and efficiently. And unfortunately most outlier detectors don't let you do that. So that can certainly affect your choice. Some are a lot faster than others or slower. And some do detect different types of outliers. Like median absolute deviation, for example, is an excellent detector, but it's intended to detect extreme values in numeric data, which means it won't detect outliers and categorical data or date or text data. It won't find internal outliers, which means values that are kind of in the middle but unusual. They're not extremely large, extremely small.
Tim Wilson
Well, so I mean I think part of, I think back to early in my analyst career once I was trying to get the team to not look at every. And this was with digital marketing data. So it was time series data and I didn't want like it was never going to move in a perfect line. And I was struggling with getting told every single week or every month, yes, the number was going to go up or down and trying to figure out had it gone up or down enough.
Val Kroll
An unusual amount.
Tim Wilson
Yeah, an unusual amount. And I, I do remember at the time like it's like I was trying to like rediscover or discover statistics from the beginning. Like it was a well intentioned but horribly executed exercise where I took, I would take the data and I would calculate the mean and then I'd do like little plus or minus one. I think I did like a best fit line, like a regression. And then I put plus or minus on either side of that line. A standard deviation. Horrible because the data was always trending. But I was. So it was time series non stationary data and I was calculating the mean which was kind of dumb because if.
Val Kroll
It'S moving, it's not a constant mean. Yeah, yeah.
Tim Wilson
So I mean that was one where. Yeah, yeah, I mean it was definitely trying to be.
Brett Kennedy
It served the, I mean it's funny.
Tim Wilson
It served the purpose in that it gave me the ability to put like a band around it. It was, it was horrible. It just gave something. And people sort of believe that oh, it needs to go outside this range. But I think that's when I discovered like I was trying to use Z score with like 10 data points and it's like, well, well, it's going to be like that wasn't. That's where I wound up finding mad because that was one that it, it's generally a little bit better when you say I have a small. Yeah, so I guess.
Val Kroll
Yeah, yeah. So yeah, no, that, I mean that's, I think a good piece of Julie's question as well is some detectors are just better suited when you have large amounts, very, very large amounts of data. They're just very efficient and some are not and some handle smaller data sets as well. And, and I think a lot of.
Tim Wilson
The, like talking about all the, talking about all the patterns, like when you do talk about time series that you have data moving along and then there's like a step function or a rupture and it changes like you call out. That's different from like a spike or an extreme dip. And it's different if the data is kind of trending. So it is seems like you need to understand the nature of the data that you're looking for outliers in and that narrows down the techniques that might be appropriate. Is that.
Val Kroll
Yeah, yeah. If you're looking at. Yeah. Within time series data. I mean, I mean part of your, I guess the answer to your question, Julie, is it depends on the type of data like you were suggesting. So if you're dealing with image data or video or audio data, there's certain tools that are appropriate. Basically, if you're dealing with that type of data, it's got to be a deep learning based technique that's likely the only thing that's going to work. And with text you might want to use deep learning or a base method or a simpler method. With tabular data there's certain techniques and with time series data there are certain techniques and it comes down a bit to some of these issues like performance and you know what's going to tend to balance your false positive and false negatives. Well, but yeah, it's also like what type of outliers you're interested in because maybe if it spikes a lot, that's not, not interesting. It just, you know, once in a while to do that. And yes, it's statistically unusual, but it's not, it's not something you would investigate. But maybe in other scenarios, you know, if you get a spike, you don't want to look at that. But if you get a whole bunch of spikes within a short time period, you might want to look at that. Or sometimes you can actually what the opposite could be interesting if it's unusually Flat over a period of time or unusually smooth straight line, which might suggest. It depends on the situation. Could suggest, like just a sensor failure. Like, if it's not. It's not particularly sensitive. So it's just giving a flat line. Like a flat. Yep.
Joy Hoyer
I was gonna say. Is that. Is that an example of what you were saying of an outlier hiding in the middle? Like, it wasn't an extreme high or extreme low. But contextually, if we're monitoring a sensor, it would be unusual that it was flat and in the middle for a while. Is that a good example of that one? Because I was like, whoa, an outlier in the middle?
Val Kroll
Like, yeah, they're everywhere. Yeah. Yes. That's one of the things about layer detection is. Yeah. Depending how you look at your data, you can. If you look at. And again, that's why there's so many outlier detectors, because if you just look at your data one way, you'll miss other ways in which it's unusual. Well, I'll get what I was getting at before with inlier detections. I'll give maybe a simpler example of that is, say you just have a list of sales records for a company, and, you know, generally most of their sales are around $10. Or you see. Sell some items that are $10, and you also sell items that are $100. So a sale for $60 is kind of in the middle. It's kind of weird. So it'd be an internal outlier. So, yeah, you can get. Where you have kind of bimodal or multimodal distributions, you can get things. You can get values in just a numeric series that are unusual, even though they're not extremely large or extremely small. And you get something that's kind of a similar issue with time series where it's. Yes, I guess the same thing where over time, maybe your temperature, if you have a. Or say you have a sensor that reads sound volume, and maybe the volume tends to be either very low, like zero, meaning machinery is off, or it's high machinery is on. If it gives you a reading in kind of in the middle, kind of halfway in the middle. Well, that might. If that occurs rarely, then that would be unusual. But it's not an extreme value. It's the opposite. It's an internal outlier. It's called. Yeah. Okay.
Brett Kennedy
So we've been talking about a lot of these outliers that are indicators of something going wrong. But I also want to talk because this is something that it makes sense, but again, seeing it framed the way you Put it in the book was helpful about that. It's not always necessarily looking for problems like I think about. Financial markets are looking for dislocations so that, you know, people can move money and take advantage of that. Right. Or like, there is an audience that's being underserved if it's a marketing example. And so now we can think about offerings for them. So I guess just curious about, like, you know, the balance of those concepts, ideas, and how they potentially work differently with the type of approach that you think about for identifying them and the processes that lead up to it and then happen because of it.
Val Kroll
Yeah, yeah, yeah. I mean, yeah, that's one of the key points about early detection is. Yeah. Anything that's unusual is not necessarily a problem, depending on the context. I mean, again, if you look at, like an industrial system, then probably anything that's unusual, odds are it is a problem because the normal functioning is also the optimal function. There's. There's no machinery. Can't spontaneously behave better than normal. But a financial market or. Yeah, marketing. Yeah, you can get. And. Yeah, well, in science, I mean, a lot of what's done is just looking for the anomalies because they're what's scientifically interesting. So if you have a collection of records about instances of certain animal species, if you find something that are. Or plant species or something like that, if you find instances that are anomalous and it's correct, it's not a data artifact, then that kind of can expand your sense of what's possible or maybe trigger some research trying to figure, well, how did. How did that come about? It's used a lot in astronomy, for example, as well, because, you know, that's one of the fields of science where we're just accumulating monstrous quantities of data. And it's just well beyond possible for a person to go through all the images every day. And even if he could, there's just so much in some of the images coming back from astronomy. So if we're looking for a kind of new phenomenon or phenomenon that maybe we have seen before, but only rarely, then just running outlier detection on the images coming back from the telescopes can be a very effective way to find something ideally new, but, yeah, at least rare.
Joy Hoyer
Do you have a favorite coming back a little bit to the time series thing? Because I think that's what we work with, at least.
Val Kroll
Okay. Yep.
Joy Hoyer
For. For time series data, which is mostly numerical, like you said. What are some of your favorite methods, maybe based on, like, interpretability? I just feel like when we have run into outlier detection that's maybe like built into certain analytics tools that we work with, there is no interpretability. So you keep talking about like, oh, based on interpretability. So I'm interested.
Val Kroll
Yeah.
Joy Hoyer
Which ones out there help us do that? Which ones have the best. What are your favorites?
Tim Wilson
Can I, can I.
Val Kroll
Sure.
Tim Wilson
Can I throw in? Just because Julie and I both have worked with like a Bayesian structural time series black box is to me the magic of exactly how all that works. But conceptually, having a forecasting technique, having a forecast forecasting engine of some sort that then says based on the data up to this point, we would expect the future data points to be X or Y and if they're outside of that range, like that I actually feel like is interpretable. Not that somebody needs to understand that this is how this winter's forecasting or something worked, but I'll just throw that in that is background that at least. Actually I think Val's sort of seen us mess around with that too. But like that, that the forecasting is one mechanism of outlier detection. But what else is. Is there?
Val Kroll
Oh, well, you know, just as Julius, you were asking that I, my. In my mind, I was thinking the forecasting based ways. It's probably my, generally my. Now some of the other methods are simpler than forecasting based and therefore might be a little bit more interpretable. But I think I agree with you, Tim, that if you make a forecast and you say we predicted on Thursday that we would have $2,000 in sales and we actually had $16,000 or so, that's well outside of what was predicted. And you can see lots clearly an outlier. It's not completely interpretable necessarily because, well, why did it predict 2000? Now it could be that if you look at your history and you see, well, here's the general trend and here's the seasonal patterns and yeah, the other information took into consideration, it could be that, you know, a prediction of 2000 is pretty straightforward. So you can see why I think in general, time series data lends itself to interpretability a little better than a lot of other types of data, just because it lends itself to plotting better than other types of data. But I mean, with looking for outliers and time series data, you can go the simplest extreme method is just looking for extreme values. Just looking, you know, you're basically just ignoring the sequential nature of the data. You're just ignoring the time component entirely and just looking at for very, very large, very, very small values. That's the simplest. And if you plot that, it's, you know, it's really easy to see, you know, why something is extreme. Like if you get a.
Tim Wilson
But if you, but if you don't plot it and you actually did have. If the data was, was like honest on a trending upward, then it would tell you that, oh, your outliers were at the start of the period and the end of the period because they were the.
Val Kroll
Yeah.
Tim Wilson
The lowest and the highest. Right. And that's.
Val Kroll
Yeah. If you can. Yeah, you do risk that. Yeah, yeah, yeah. And that, that method can be a little bit overly simplistic. That's why I am kind of leaning, I do kind of like yourself, lean towards the forecast method for outlier detection.
Tim Wilson
But, but is, is that one where like and I, we had a recent episode about linear regression. We did not talk about like arima, but. Okay, is a, is ARIMA like the, like the base level sort of forecasting that that's like. Or does ARIMA fit? I'm trying to bucket them in my head.
Val Kroll
Yeah, I think, I mean, when it comes, when you, when you think really talk about outlier detection with time series data using a forecasting method, I mean you can, you can use any forecasting method. So ARIMA methods would be common or exponential smoothing, but you can use recurrent neural nets or really anything that's able to project into the future. Even standard machine learning techniques like, you know, random forests and XGBOOST and cat boost and so on can be used for time series prediction. That they have some limitations in that, you know, if we have the situation you just described where you have like a strong upward trend and, you know, the future values are going to be generally a little bit larger than anything that's occurred in the past. Those types of models can struggle to make good predictions in that situation. But if you have a time series that's fairly stable over time, if there's not an upward or downward trend, they can work well. And in that situation, just looking for extreme values can be one way to find a certain type of outlier. Now we'll miss other outliers, like more contextual outliers where you have a spike or something where you have a value that's like say you're looking at daily sales figures or something like that. You might have a value that's normal, but not for that time of year or that day of the week or something like that. It's just different or it's just different than the values that have occurred recently. So it's still an outlier in that sense, but it's not an outlier in the sense of being unusual over the last year or something. Yeah.
Joy Hoyer
And what's interesting too, the technique that Tim brought up, I was just thinking like the context behind that or what makes it interpretable is that you're choosing your pre and post period based on like what's changing in the environment. So you have that knowledge going in, assuming that. Right. Everything else is held constant enough over the period that you're isolating the mechanism that would cause an outlier. You're hoping, you know, in whichever positive direction and improvement. So it's interesting those, like the interpretability comes from that again, context of the situation and you choosing these parameters to then apply it.
Tim Wilson
But what if, but if you were monitoring something, why wouldn't you say that? I'm going to keep rerunning it and it's always going to use the data up through yesterday or through a week ago. And then like, I don't have a planned intervention. I'm just going to, I guess I have to have an assumption that there are any outliers that occurred before last Saturday aren't going to muck the model up so much, which they can detect.
Val Kroll
Yeah.
Tim Wilson
So that's, that's the risk. That's the issue.
Val Kroll
Yeah, there is a risk. Yeah. I mean that's actually one purpose for you doing outlier detection on time series data is even if you're not interested in delving into those outliers, just removing them because they can. Outliers in your time series will affect your sense of what's normal, which can preclude your ability to really find outliers.
Tim Wilson
If you remove too many outliers and then use that to build your forecast, then.
Val Kroll
Yeah, yeah, it's a balance you're going to keep running.
Tim Wilson
Oh, yeah. Wow.
Val Kroll
Yeah. But you can imagine, like say you, you're, you train your model on one day's worth of data in order to see if there's any outliers the next day. But the day you train it on is Christmas or Black Friday or something. That's an anomalous day. Or it's just a day when your computer system went down for a bit. So if you use that, it's a really good point you make because this is one of the tricks of outlier detection is what do you use to define normal or. And sometimes what's normal can be different than what's desirable. So now I say like in an industrial system, often they're fairly close to the Same thing. But if you're looking at marketing data or something like this, yeah. There can be distinction and without liar detection, it's always compared to what. So this can be. So if you have a day where, I don't know, like say you're monitoring, you say you have a web site you're monitoring and you're making sales on the website, maybe getting a lot of sales that day, and you're looking. Was that anomalous? Well, compared to what? So you can look at the last day or the last week or the last month, the last year, and all these can be kind of useful to compare to, but also misleading to compare to. Right. So I think my tendency, for what it's worth, is to have kind of multiple frames of reference for when you're looking at outliers. So say, okay, the sales we've had the last hour, that's unusual compared to yesterday, but it's normal compared to say it's Tuesday, last Tuesday, so. Or it's unusual compared to this day of the year, last year. And you can say, okay, well. And then you have a little bit more context. It gives you kind of sense. Like if it's unusual in all those senses, then you say, okay, this is more unusual.
Tim Wilson
It does. And this kind of goes back to where you were starting with trying to build sort of a generally repurposable outlier detection into kind of the financial world. In the digital marketing world, every platform has at least once, if not twice, come out with a hey, here's a way to just like, you know, intelligent alerts, like, which is basically like, we're going to do anomaly detection at scale on the behavior on your website.
Val Kroll
Sure.
Tim Wilson
And those have like comically failed, like the night of the day coming out of the product because they find things like, oh, you know, sales have tanked in California. And it, you know, in this part of California and it's right when there are wildfire. California, where everything shut down.
Val Kroll
Yeah.
Tim Wilson
Or they generally really find things that are just. That really are kind of legitimate noise because it's like trying to figure out that threshold or I mean there's Adobe analytics had this where you could set your threshold for what you wanted to monitor. And it was basically giving you the technique was something that you'd set a 95% or 99% threshold. And the idea was this was gonna just kind of magically tell the analyst like where to go look for stuff. But it always the the calibration of not having it say it's identifying outliers every day, in which case you're overwhelmed and can't track them down. Or it's like so rarely to finding outliers that there's some suspicion that stuff is being missed. Like, as you're describing it, it seems like it's so much more about. You have to kind of think through the. Call it, the process. It doesn't have to be manufacturing or sale. I mean, it can be, what are these things that you're looking in? What is the nature of that underlying data? What is it you can impact? What do you truly care about and what can you impact? And then what is the right technique and what's the right way to do it? I don't know. Maybe I'm mounting my rant. There were just times when each one of those rolled out, they'd say the analyst job just got easier. Which I actually think AI, there's some thinking of that too, that point the AI at the data and it will tell you, give you insights. And it's like, well, it's just gonna, it's gonna find outliers, but without the context or context or the.
Val Kroll
Intelligently. Yeah, I think, I think what you're just saying is kind of bringing me back to your first question is why did I want to write the book on this? And partly because that is such a difficult problem. I mean, I don't think it's unsolvable, but it is a very difficult problem. And we were kind of in that situation at the time too, where you're trying to make something that's like, if you're making software for your own company and you're writing it and you're using it, it's a simpler process because you can just constantly tweak it and tweak it, tweak it until eventually it gets, you know, it gets better over time. When you're making software and having to, you know, send it out, you kind of have to go through all that process, you know, ahead of time. You got to do a lot of, kind of, you know, thinking what can go wrong here? Which. And it's exactly what you're saying. A lot of over reporting and a lot of underreporting, a lot of where it's not underreporting, but you don't really know that because you can't. It's hard to. It's hard to determine. Yeah, these, these are tricky problems. And this. So what I was finding too, is just going through the literature on outlier detection is almost all of it was just talking about how outlier detectors work. And there's a little bit of a lot of discussion, I guess, as well, around how to combine multiple outlier detectors, how to set thresholds on the scores that they're producing, and things like that. So these sort of technical issues that are very, very important, and I cover those in the book too, they are important. But what was always being left out of the discussion is just what you're talking about now. How do you actually maintain an outlier detection system that gives you the outliers you want without giving you a whole ton of other ones? And I think, fortunately, realistically, when you first set up a system, it's probably going to. It's just going to flag anything that's statistically unusual for whatever reason. You need a way to kind of tune it over time. And there are ways to do that. And some of these things aren't necessarily bad. Like, I mean, when there was wildfowls in California, probably certain things were down and they were legitimately outliers. It'd be kind of weird if it didn't flag that. But the point you're making is it just generates all kinds of noise, right? It just gives you things that are unusual in ways that are not pertinent to you. And there is a little bit of the process involved with just. Well, a lot of it comes down to categorizing the outliers that it's producing. So he says, okay, if it finds this type of thing, I know I'm interested, and if it finds this sort of thing, I know I'm not interested. And if it finds anything else, then I haven't seen that before. I want to investigate and determine that. And then over time, you can kind of tune the system to find what you want more than what you don't want. And part of that's like, comes down to a question you asked before, is like just choosing the detectors you used. You may, as you get going, realize, well, I'm not actually that interested in extreme values. Some contexts you are, but maybe in some contexts, or maybe you specifically are interested in extreme values, but it's flagging a lot of things that are unusual combinations of values which I'm not interested in. So you might want to go to. You might want to downplay those outliers unless they find something super extreme. Bother me. Just tell me about the. Or maybe not. Maybe you're not interested in very small values, only very large values. So you can do some filtering on the output of the outlier detection. But it's also a lot of comes down to tuning it, tweaking it to give you to put the emphasis on the sort of things that you are most interested in. And a lot of things, what happens to with projects is people start over time, they change their mind partly because, you know, if something was interesting, it legitimately was interesting, but just no longer is. At this point, it's understood. And so you see, okay, I'm glad it flagged it for a while, but not in the future. Or it starts flagging things that, geez, I never thought of that. It's the unknown unknowns that can be really be interesting and important. So you don't want to dampen down what it's producing too much because sometimes those are the most important ones, even though they can be outliers among the outliers that are flagged. So levels.
Tim Wilson
How much of an outlier?
Val Kroll
Yeah, yeah, yeah. You can actually, and I'm not being facetious, you can actually run outlier detection on the results of your outlier detection. And that can be quite legit because say you have a case where you're just analyzing millions of documents or millions of table rows or something like that. If, say you have a billion transactions that come through because you're a credit card company or something, even if you're only flagging a hundredth of a percent of the transactions, that's still a lot, which means it's still beyond the, probably the ability of a person to go through them. So you really, I mean, you probably at first need someone to go through just to see what you're, what you're dealing with. But yeah, over time you want to kind of reduce that workload. So, you know, if it's just kind of flagging the same thing over and over and over again, because it truly is statistically unusual. It only occurs once every, you know, 10 billion rows. But you know, by the time you've gone through a trillion rows, it's like over and over and over.
Brett Kennedy
So MO isn't here, but I have a scenario I have to one of our other co hosts. So this unearthed, honestly preparing for the show unearthed so many memories unlocked for me. But I used to be in market research and we used to do different types of like. It was called a conjoint study or conjoint analysis. And basically it was like understanding which like features and attributes and combinations of those things were most attractive, like at what price point. So there was a study that was commissioned by US Cellular rip they didn't go out of business because of the study, I swear. But we did this study for them to understand, like, what features should they consider and what combination to make for the next smartphone and what price point should they potentially sell them at. And we discovered through a couple of those analyses that there was this group of people who were not sensitive, that there was just no combination that would. That would make this an attractive offering. Turns out through some secondary analysis that they were Apple oil. And this is like 10 years ago before people were like, as divided as they are today.
Val Kroll
Okay.
Brett Kennedy
But we decided that we were going to remove those outliers air quotes. So I don't know if this was a. The right thing to do or be the right labels for it. So I'm just interested in your reaction. We remove them, but in future times we ran the study, we would ask, like, do you currently have an iPhone and how much do you like it? And if they did and they liked it, we would just remove that. We didn't even. Thank you for participating. Here's your $10 Starbucks gift card.
Val Kroll
Sure.
Brett Kennedy
No longer interested in your. Because they didn't behave like everyone else who. Who was sensitive to the addition or subtraction of features at different price points. But I'm just curious about your take. Like, first of all, were those actually outliers? I honestly don't know how you would define them. And like, just the application of that in the context of that business purpose.
Val Kroll
Yeah, I mean, they might be outliers or not, just depending on. Again, there's no real great definition of that. But if they're rare in your data, then now are they rare in society or where they have to. And if you go back to the, you know, the 1970s or something. Yeah, that was probably pretty rare.
Tim Wilson
Val has never owned an Apple product in her life, so she considers. Considers Apple Apple as outlier. She's like, yeah, freaks.
Val Kroll
And we're like, okay.
Tim Wilson
No, it's a pretty sizable part of the population.
Val Kroll
Well, as the thing, there's kind of a statistical sense of normal and there's the Platonic sense of normal. So you can, you can argue that no matter how statistically normal they become, they're still. You can argue that wherever you wish you could argue that. I guess. Yeah. I mean, yeah. When you're doing that kind of research. Well, another example that comes up in my line of work is with creating predictive models, which is the same sort of ideas. You know, you train your model on a certain type of data and then if once you put it into production, if it's encountering data that's unlike the data that it was trained on, it's not going to know how to behave and it's just going to basically be behaving kind of half randomly. And so it could be in a situation like you're describing, you can do market research on a certain cohort of the population that has certain properties. And then if you try and market creating a marketing campaign to people are just different than the people that you looked at, their ages are unusual or their country they live in or language they speak in any way that might be relevant. If it's different than the group that you did your market research on, it's going to be hard to extrapolate your findings. You know, that you're extrapolating your findings is going to be a little bit suspect. Yeah. So it can be a useful exercise to do so.
Brett Kennedy
Were we wrong when we called them outliers?
Val Kroll
Well.
Brett Kennedy
Again, context, Tim, this was like 15 years ago. I was a baby analyst. So.
Val Kroll
Okay, well as prevalent, I'll go with a statistical definition and say that depends on how many of them there were. So it sounds like, it sounds like they weren't too many. So I'm going to say if that's true, then yes, they were just to call it liars in that sense. But one of the things that's important too is like everybody is statistically unusual in some sense. Like if you measure, if you look at us in enough different ways, we're all odd and you know, well there's, you know, if you ever heard of what's called the myth of the average. Yeah. So. And I think if you had. Okay, so did you hear about it through the studies, the U.S. oh, studies, studies. They're looking at fitting cockpits to their pilots. And so they looked at, you know, the average leg length, the average arm length, the average, you know, torso width. And I forget how many dimensions they measured everybody in. But what they found was their pilots were. Their pilots were all generally average in most of these dimensions. But there was no single pilot in the entire U.S. air Force that was normal in every single dimension. And that's just looking at, you know, our physical dimensions. If you start, you know, introducing like, you know, hundreds of ways to, to look at people like, you know, do they like Apple versus whatever, Android or whatever.
Tim Wilson
Yeah.
Brett Kennedy
The group that shall not be named.
Val Kroll
Yeah. There could be people that are just die hard fans of some like, I'm sure there's still BlackBerry fans out there. Actually I'll be 100% there's still. Because I mean there's one. One there's nothing wrong with it. And two, there's very few people buy them. So yeah, so there's definitely the people you're looking at doing market research you can always defined as being unusual with respect to the product you're trying to sell in some.
Tim Wilson
Yeah, I mean look at car manufacturers, right? The whole invisible women, the, you know, all women were outliers removed from the data set for the crash test.
Val Kroll
Oh yes, right.
Tim Wilson
Like that was.
Val Kroll
Yes.
Tim Wilson
On that front I think like. Well, that was probably, you know, treating them as outliers to be removed might not have been the way to go.
Val Kroll
Yes. Yeah.
Tim Wilson
Well on that note, we, we didn't get into Z Score or modified Z score or so many other techniques that that maybe would be a little actual tough to get too, too deep into. So I will, before we move into last calls, I will, you know, plug the outlier detection in Python book again. It is very readable, has both the code examples and the explanations and visualizing of the data sets. So if you're interested in more on this topic then cannot recommend that book highly enough. But before we go, we like to do a last call and go around, have everyone share a link or a thought or a post or a book or a movie or a outlier, something that stuck out to them as being interesting. And Brett, you're our guest. Would you like to go find first?
Val Kroll
Sure, yeah. I mean one of my interests as well as outlier detection is interpretability. So I think I talked a bit about how interpretability is important for outlier detection. There's not nearly enough research in that field, but yeah, I'm interested in interpretability in general and explainability as well. And one of the techniques for that is shap, which relates to feature importances. Oh, so one of the.
Tim Wilson
Julia just perked up.
Val Kroll
What?
Joy Hoyer
Yep, we just talked about this at work. It was very timely that you said that.
Val Kroll
Oh, okay. Yeah, I've been interested in shop values for, for many years and permutation importances as well. I just, I, and I do read a lot of articles on medium and one that caught my eye a little while ago is by a writer named, I think a Samuel Mazanti I believe it's called. Your features are important doesn't mean they're good. And it just, it's just I thought it was a really nice article that kind of explained how your shop values tell you what features your model is using, but they're not telling you what features it should be using. So you know, the features you're using are contributing to your model predictions, but they're also contributing some of them to your model error. There's actually a set of articles on Medium talking about that sort of thing and talking about how you can use SHAP for feature selection and combining it with permutation importance. Just get a better sense of what your model is doing and why.
Tim Wilson
Nice. Julie, can you share what the nature of the discussion you were having about SHAP values was?
Joy Hoyer
Yeah, we. One of my coworkers, he actually got asked by a client, Their data science team ran a model and they got all this output with a bunch of Shap charts. And it comes from Shapley, right?
Val Kroll
Yep.
Joy Hoyer
And Tim, I thought you and I and Ben had a whole conversation about Shapley charts, but that's what this SHAP was. And he had a ton of the charts and it was kind of crazy. It was like he was then asked to make the model output in this 25 page readout, technical readout, more interpretable for the data scientists to take to their stakeholders. So we were kind of talking about like, well, what's what matters here? And we were getting down into like, per location that the model was trying to predict locations for customers. And it was interesting because you had a Shap chart for each outcome of the model. And so then we got into the discussion of like, all these features look different per location outcome. But I think what's most important is like, you just want to know, can you trust of these features available, do you trust your model? Is it accurate enough to then use it to predict these outcomes and then you can go market based on these predictions? You know, so it was kind of like, let's take a step back. But we got like really deep into the Shap charts before we got there.
Val Kroll
You can go extremely deep into. There's a lot of nuances with. With them.
Joy Hoyer
Yeah.
Tim Wilson
Wow.
Joy Hoyer
And the big question that you said, like, do you have the right features listed? Like, you can have a bunch of crap features.
Val Kroll
Yeah.
Joy Hoyer
And have a Shap like graph for it. But it's like, it doesn't really matter if they're crappy features. But it won't tell you that.
Val Kroll
No, it won't tell you that. No, no, it won't. But yeah, no, that's the thing. I think they're very, very useful charts, but I think sometimes they're made to be somebody that they're not.
Joy Hoyer
Definitely.
Tim Wilson
Wow. Well, Julie, do you have a last call to top your.
Joy Hoyer
I do, but I'm glad I got to comment on Shap because my last call is not anything close to that.
Tim Wilson
Start with an SH too?
Joy Hoyer
No, actually a ch. So my last call is just something that has brought me joy and it has an AI feature in it that now I'm thinking maybe uses outlier detection. So this app is chat books. It's just an easy way, as a busy mom parent, to like, keep some memory books printed and like, coming monthly. And it has a feature, though, that I really like using, which is like, select my best 30 photos from the previous month to like, get my book started and then I can edit from there. And I always wonder, like, how does it choose? Some months it's like, spot on. Some months it's, yeah, you know, like giving me not screenshots, but like random photos that I'm like, I don't need this printed.
Val Kroll
So that'd be a good question. You'd want. Would they be the most typical for the month or would they be the most atypical?
Joy Hoyer
Right. And that's what I'm wondering. I'm like, is it. It's got to be based off the mix of photos I had for the month. Right. And maybe some months it's very clear that I was like, at an event and it's choosing, like, the clearest photo from, like a run of similar photos or something.
Val Kroll
Yeah. If I wrote it, that's what I'd be tempted to do. But yeah, is that really the best? I ye.
Joy Hoyer
So the technology part of it had me wondering because I'm just interested, like, how are they doing it? And seeing as a user, like, do I find this feature actually useful or not? But in general, I've just really enjoyed that app and getting the photo books and it's fun.
Val Kroll
Good.
Tim Wilson
Nice.
Brett Kennedy
I like it.
Tim Wilson
Val, what's your last call?
Brett Kennedy
We might have to edit this because there's a word that I can't say in the title and the subtitle. Okay, ready? And I know the word, I just can't pronounce it. Our human habit of anthropomorphizing everything. Yeah, anthropomorphizing. I can't.
Val Kroll
Oh, no.
Brett Kennedy
I did that work.
Val Kroll
Okay.
Brett Kennedy
I practiced this five minutes straight before I had Google say it out loud to me before I joined this recording. And my. My mouth can't make the motion.
Val Kroll
Anyways, not an issue. I 100% know what you mean.
Brett Kennedy
Okay, so we all know what I'm talking about. So this is also a Medium article published by Doc, written by Daily Wilhelm, and I thought it would be interesting because, like, I. We all know about the, you know, should we be nice to generative. Generative AI or like the interfaceless AI saying like, thank you or please. But as I was reading this, I saw so many more examples that I hadn't even thought of, which just makes it a really interesting reading. So even saying like, the AI listens or learns is doing that versus saying the AI is designed to, or it's built to. If you say that it sees this or it looks for this, it's actually detecting or finding patterns. And I'm like, it's so. Like, it runs so deep. But anyways, it talks about some of the problems with doing that, that it actually changes the way people expect what the output is in terms of, like, accuracy or the role that it can play. So anyways, it's just kind of like a thought piece. But I really enjoyed that. And so hopefully you all can practice saying that word with me. Anthropomorphizing.
Tim Wilson
Does it talk about the different ones that they'll show? They'll say, thinking like, while they're like, that's the waiting is a. Yeah, yeah.
Val Kroll
I think they encourage us thinking of them as kind of human terms because just how they communicate with us. They do say, like, I'm thinking about this, or I'm looking at that, or yeah.
Brett Kennedy
And I'm someone who anthropomorphizes everything in. Like, I've done this for years. Like, I remember trying to explain to my roommate why our DVDs needed to be in alphabetical order in the case. And she's like, why? And I was like, because that's how they like to be. And she's like, you need some sort of exotic pharmaceuticals to chill out. Because DVDs can be in every order you want anyways. But that's an interesting read. All right, Tim, how about you?
Tim Wilson
Well, I guess I'm going to go with a streak of medium posts because this one I like. It has been out for a while and it's one that. But it's. It's a post called Optimizing at the Edge Using Regression Discontinuity Designs to Power Decision Making. And it's just like, it's kind of coming out of Instacart, kind of the tech team in Instacart, like, kind of publishing how they do it. But it's a good explanation. Like, regression discontinuity is one of those things that I. Well, like Joe Sutherland would reference as being a technique. But I hadn't found something that gave a sort of a good simple explanation of like, what is it? How does it work? So it just kind of falls in that whole little pot of like, oh, here's another kind of technique that in the. In the hands of a more skilled individual, I mean, for me, it'd be running with the. Running with scissors to actually try to do it. But it came through my feed and I kind of read it and kind of got into it. I was like, oh, that's.
Val Kroll
That's cool.
Tim Wilson
For a brief moment, I understand what that's actually doing.
Joy Hoyer
So those are always good moments when you're like, oh, I think I get it. And then, you know, 10 minutes later you're like, I don't know if it's.
Tim Wilson
Well, then I made a note of it.
Brett Kennedy
Close that tab yet?
Tim Wilson
I'm gonna use this for a last call in a few weeks. And now as I'm using it for a last call, I'm like, yeah, I hope nobody asked me to totally explain. Yeah, gotta read my own last call again. Damn it. All right, well, this has been. This has been a fun chat and super informative like I think we have. We've definitely proven there's. This is a deep. A broad and deep topic. So, Brett, thanks so much for coming on and joining us for this.
Brett Kennedy
Awesome.
Val Kroll
Yeah, thank you very much for having me. That was. It was very nice. Cool.
Tim Wilson
No show would be complete without thanking our producer, Josh Crowhurst, who should probably also call out that starting a couple episodes again ago, we. We sort of changed our. Our production methods slightly. Be kind of curious if you've noticed because it's been a couple episodes, but I actually do want to say behind. Behind the scenes behind Josh was a guy named Darren Young, who for years has been. Had been doing kind of the final, final step of production. Kind of contracted him to do that, and he did it patiently for a long time, and we threw him curve balls as well.
Val Kroll
So.
Tim Wilson
So I just want to call out that Darren was way behind the scenes, did not get acknowledged. I feel like we should acknowledge him at least once. Josh is running with how we're figuring out kind of the podcast without Darren. So thanks to Josh. Always Crowhurst. Thanks to Darren. At least the one time publicly knowledged. If you enjoy the show, please leave us a review on whatever platform you listen to us on. We do really appreciate having ratings and reviews or, you know, just share it with a colleague, throw a note out on LinkedIn or tag someone or send them an email or a text or make a TikTok about it. That would be weird. But pass along the show's existence if you Enjoy it. If you want to reach out to us, you can contact us through LinkedIn. You can reach out to any of us on the Measure Slack. You could just send an email to contactalyticshour.IO. so with that for my co host on this episode, Julie and Val, who only one of them is an outlier in everything she does. And she has shared a couple. Which one?
Val Kroll
Which one?
Tim Wilson
What do you think? So I know they join me in saying thanks for listening and keep analyzing. Thanks for listening. Let's keep the conversation going with your comments, suggestions and questions. On Twitter @nalyticshour, on the web at analyticshour.IO, our LinkedIn group and the Measure Chat Slack group. Music for the podcast by Josh Grohr. So smart guys wanted to fit in.
Val Kroll
So they made up a term called analytics. Analytics don't work.
Tim Wilson
Do the analytics say, go for it.
Val Kroll
No matter who's going for it.
Brett Kennedy
So if you and I were on.
Tim Wilson
The field, the analytics say, go for it.
Val Kroll
It's the stupidest, laziest, lamest thing I've.
Tim Wilson
Ever heard for reasoning in competition. I don't know, Brett, if you signed your book for people, but I've had a couple of. I've had a couple of book signs.
Brett Kennedy
Julie, did you hear this story yet?
Tim Wilson
It's not that funny.
Joy Hoyer
Obviously it is.
Tim Wilson
I did a book signing at the Columbus Data Analytics Wednesday meetup. And, I mean, I Googled, like, how do you. What do you. What do you freaking write? Like, I can't come up with spontaneous stuff. So I went down the history of book signing and kind of landed on as just like, whether it's 4 versus 2, whether they bought it or you're giving it to them, or whoever will dash. And then seems like all the best.
Val Kroll
Period.
Tim Wilson
Sign your name, which. So I was sitting there signing all the best. Sign my name. All the best. Somewhere in the middle. Best. It's not that funny.
Joy Hoyer
I gotta hear the punchline.
Tim Wilson
I signed one of them. Best of luck.
Val Kroll
No, good luck.
Tim Wilson
Good luck. Good luck. Okay, Val remembers it better than I do.
Val Kroll
Oh, my God.
Tim Wilson
This is good luck. And it was just a random person, so I don't know who it was. I don't know how they interpreted.
Brett Kennedy
Hopefully they didn't compare to the other 35 people.
Joy Hoyer
You need this book. Good luck to you.
Val Kroll
Good luck.
Brett Kennedy
What do you know that I don't know?
Val Kroll
Dumbass. Oh, my God.
Joy Hoyer
Rock flag. And unusual doesn't mean useful.
Podcast Summary: The Analytics Power Hour – Episode #269: The Ins and Outs of Outliers with Brett Kennedy
Release Date: April 15, 2025
In Episode #269 of The Analytics Power Hour, hosts Tim Wilson, Val Kroll, and Joy Hoyer delve deep into the world of outlier detection with guest Brett Kennedy, a seasoned freelance data scientist and author of Outlier Detection in Python. The conversation explores various outlier detection techniques, their applications across different domains, the importance of interpretability, and the challenges analysts face in effectively identifying and utilizing outliers.
Tim Wilson kicks off the discussion by expressing his enthusiasm for the Median Absolute Deviation (MAD) technique, introduced in "00:05" timestamp. MAD is highlighted as one of several methods for detecting outliers, especially useful in small datasets.
Val Kroll emphasizes that there is no single definitive definition of an outlier, leading to a multitude of detection methods. She explains, “12:39 Val Kroll: “One of the consequences of that is there's a lot of different ways to try and find outliers.” This diversity arises because outliers can be context-dependent, varying across different types of data and use cases.
Brett Kennedy shares his journey of developing outlier detection software for financial auditors, explaining the necessity of robust and interpretable methods in environments handling vast amounts of data.
Brett Kennedy elaborates, “03:11 Brett Kennedy: “Outlier Detection in Python, which is 17 chapters of Outlier goodness.” His work addresses not just detection but also the interpretability and testability of outlier detection systems, a critical aspect often overlooked in existing literature.
One of the core challenges discussed is the balance between false positives and false negatives:
Val Kroll states, “17:16 Val Kroll: “Interpretability is very, very important. It's usually much more important than in other areas of machine learning.” This is because understanding why an outlier is flagged is crucial for actionable insights, especially in fields like fraud detection or industrial monitoring.
Tim Wilson shares his early experiences attempting to detect outliers in digital marketing data, highlighting the difficulty of setting appropriate thresholds in trending, non-stationary datasets.
The podcast explores how outlier detection varies across industries:
Financial Auditing: Detecting unusual transactions that may indicate fraud or errors.
Marketing: Identifying shifts in consumer behavior, such as delinquencies or underserved segments.
Scientific Research: Discovering rare phenomena in vast datasets, such as astronomical observations.
Industrial Monitoring: Using sensor data to detect anomalies that could indicate machinery malfunctions or safety issues.
Val Kroll explains, “26:56 Val Kroll: “Anything that's unusual is not necessarily a problem, depending on the context.” For instance, annual payments in financial data might appear as outliers but may be normal upon contextual analysis.
A recurring theme is the necessity for interpretable outlier detection methods:
Brett Kennedy emphasizes the role of human analysts in interpreting outliers, stating, “13:25 Brett Kennedy: “There's still a very important role for an analyst in all this.” Automated systems can flag potential outliers, but human expertise is essential to determine their significance.
Val Kroll discusses techniques like SHAP (Shapley Additive Explanations) for enhancing model interpretability, allowing analysts to understand feature contributions to outlier scores.
The hosts and guest share practical examples illustrating the complexities of outlier detection:
Credit Card Fraud Detection: Brett Kennedy recounts a personal experience where his credit card alert was triggered by an unusual purchase location, underscoring the systems' limitations and the importance of contextual understanding.
Market Research Case Study: Brett Kennedy discusses a conjoint analysis study for US Cellular, where a subset of respondents were deemed outliers based on their loyalty to Apple products. Val Kroll reflects on the subjectivity involved in labeling such groups as outliers, highlighting that "there's no real great definition of that."
Val Kroll and Tim Wilson explore the fine line between meaningful outliers and noise:
Val Kroll suggests using multiple frames of reference for outlier detection to reduce false positives. For example, comparing sales data against daily, weekly, and yearly benchmarks can provide better context.
Tim Wilson criticizes automated outlier detection systems that either overwhelm analysts with too many alerts or miss significant anomalies due to rigid threshold settings. He notes, “38:50 Tim Wilson: “...[automated systems] find things that are really legitimate noise.”
Understand Your Data: Grasp the underlying patterns and trends to choose appropriate detection methods.
Choose the Right Technique: Select methods suited to your data type and business context, balancing interpretability and accuracy.
Contextual Analysis: Always interpret outliers within the context of your specific domain to distinguish between meaningful anomalies and irrelevant noise.
Continuous Tuning: Regularly refine your detection systems based on feedback and evolving data patterns to maintain relevance and accuracy.
Human Oversight: Leverage the expertise of analysts to interpret and act upon detected outliers effectively.
The episode concludes with the hosts and guest sharing resources and personal insights:
Brett Kennedy recommends his book, Outlier Detection in Python, praising its readability and practical examples.
Val Kroll mentions a Medium article titled "Your Features Are Important Doesn't Mean They're Good" by Samuel Mazanti, which discusses the nuances of feature importance in model interpretability.
Joy Hoyer shares her experience with the Chatbooks app, pondering whether its AI-driven photo selection uses outlier detection to curate "best" photos, reflecting on the broader implications of AI in everyday applications.
Tim Wilson encourages listeners to explore Brett's book for a thorough understanding of outlier detection techniques and their applications.
Tim Wilson [00:05]: “Median absolute deviation is one of many, many techniques for detecting outliers.”
Val Kroll [12:39]: “If there were, we would just have one outlier detection algorithm and that would be it.”
Brett Kennedy [03:11]: “...author of a book outlier Detection in python, which is 17 chapters of Outlier goodness.”
Val Kroll [17:16]: “Interpretability is very, very important. It's usually much more important than in other areas of machine learning.”
Brett Kennedy [13:25]: “There's still a very important role for an analyst in all this.”
Episode #269 of The Analytics Power Hour provides a comprehensive exploration of outlier detection, emphasizing the importance of selecting appropriate techniques, ensuring interpretability, and maintaining a balance between detecting meaningful anomalies and avoiding noise. Brett Kennedy's expertise offers valuable insights into practical applications and the ongoing evolution of outlier detection methodologies. Whether you're an analyst, data scientist, or business professional, this episode equips you with the knowledge to effectively identify and leverage outliers in your work.
Enjoyed this summary?
To dive deeper, consider listening to the full episode and exploring Brett Kennedy's book, Outlier Detection in Python. Engage with the community by sharing your thoughts and experiences on platforms like LinkedIn or the Measure Slack group. Keep analyzing!