
Synthetic data: it's a fascinating topic that sounds like science fiction but is rapidly becoming a practical tool in the data landscape. From machine learning applications to safeguarding privacy, synthetic data offers a compelling alternative to...
Loading summary
Winston Lee
Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language.
Michael Helbling
Hey, everyone, welcome. It's the Analytics Power Hour, and this is episode 274. Yeah. Today we're diving into a topic. Sounds like maybe it came from a sci Fi script, but it's actually very much part of the real world data landscape. That's right, synthetic data. You know, whether you're using it for machine learning, protecting privacy, or just giving your dashboard something to chew on when the real data won't play nice, it's definitely having a moment in our industry. And unlike original data, it won't ghost you with missing values or weird outliers or those inexplicable rows that look like someone fell asleep on the keyboard. We'll talk about how it's made, where it shines, and where it might fall short. So whether you're deep in data science or just data curious, I think this podcast will be for you. And it'll be kind of like synthetic data, hopefully generated with purpose and surprisingly useful. But first, let me introduce my co host, Val Kroll. How you going? Or how you doing? I'm so used to introducing Val.
Val Kroll
Yeah, I'm not mo.
Michael Helbling
I know. How are you?
Val Kroll
I'm doing good, Michael. Happy to be here.
Michael Helbling
And also joined by Julie Hoyer. Julie, welcome. How are you doing?
Julie Hoyer
I'm doing great. I cannot wait to talk about this topic.
Michael Helbling
I know. I'm excited as well. And I'm Michael Helbling. And so for this show, we absolutely needed a guest. Winston Lee is the founder of Arima, a startup specializing in synthetic data and marketing mix modeling. Prior to that, he led data science teams at PwC Canada and Omnicom Media Group. He's also a lecturer at Northeastern University and sits on their program advisory committee for the Master's in analytics. And today, he is our guest. Welcome to the show, Winston.
Winston Lee
Thank you. Thank you. Great to meet you. Great to meet everyone. And thanks, Michael.
Michael Helbling
Awesome. Well, I think a great place to start on this topic of synthetic data is really just to talk about what it is. So, you know, how would you define synthetic data and what makes it fundamentally different from anonymized or sampled data? Let's say.
Winston Lee
Yeah, very good question. So, synthetic data, to put it simple, it's data sets that are generated by an algorithm as opposed to being collected from some sort of real event. So specific to consumer data, which is the space that we work in, we're not going out to conduct surveys. We're not tracking people, we're not asking people to Provide anything to sign up. None of that. Synthetic data simply means we develop a computer algorithm where we could generate generate data in a way that we, in a way that mimics real data, so to say. So we're not just randomly spreading out numbers, we're generating data based on certain patterns. Obviously there are two things that we should note. One, we're not making up data. A lot of people when they say synthetic data, they think of the word fake. They think it's fake data. It's not. And the algorithms that we use to generate synthetic data is indeed trained on real data. So it is based on learnings from learnings of patterns from real data in which we generate synthetic data. So first, so first point is that it's not fake, it's just like real data and it is very useful for statistical analysis. We'll get to the whole discussion of privacy a little later on, which is the main motivation of synthetic data. But from a utility standpoint, in theory, it should be as useful as real data. The other thing I also want to point out is there's actually a lot of synthetic data in our day to day lives that we simply don't realize that being synthetic data, we call them something different, but they are in fact along the same lines of same motivation, so to say. If you imagine something like a mid journey, we're generating synthetic images. If you consider image to be data than synthetically generated images, like synthetic faces or you know, synthetic pictures of, you know, different places, animals, sceneries, whatever, that's a form of synthetic data too. We're simply generating synthetic pixels, so to say, but based on the pattern so that they look like a picture in the end. Same thing with, you know, even ChatGPT or Gemini. If you consider, again, if you consider words to be data, that too is a form of synthetic data. So the fact that we use computers to generate some form of information, let's say based on learnings from real information, is much more common than we see and it's much more common than we recognize. And broadly speaking, that is synthetic data.
Val Kroll
I was 100%, I'll admit, in the camp of thinking synthetic data just was something you materialized out of thin air. But I have to say that I did have the benefit of getting to see you present Winston at measurecamp, New York slash New Jersey a couple months ago, which was a fantastic presentation and I learned a ton about it. I think one of the other questions I'd love to ask you as we're like kicking this off, it's like, what are some of the really common use cases that, you know, people within our field are using synthetic data. Like what problems is it really solving?
Winston Lee
Yeah, in some ways synthetic data does not attain to new use cases. So to say it's not like there is something synthetic data can do that real data cannot do. People consider synthetic data more as a way to, let's say, be able to do things that, you know, privacy laws don't otherwise allow them to do. So in some sense people are doing certain things with synthetic data because to try to do the same thing with real data while technically possible is from a procurement or from a legal standpoint, very, very difficult. So there's actually nothing special about synthetic data other than the fact that it is.
Val Kroll
So should we just wrap it up here?
Winston Lee
You know, I don't want, I, I don't want people to think synthetic data is fundamentally different than real data in any way that I have to pick one or the other. The best analogy I can think of is think of it like a photocopy document, if you will. You know, you have an original document, you don't want to use it, you're scared to lose it. You know, for various reasons, you make a photocopy of it, that photocopied version will serve, will bring enough use, bring enough utility just like the original document. However, it's much safer to work with the photocopy version because you can write on it, you're not afraid that you're going to lose it in a data scenario. Obviously you don't have to worry about people suing you for doing various things. You don't have to worry about, you know, revealing identity of somebody. You don't have to worry about people opting out. So it brings you all of the safety benefits while achieving essentially the same thing that real data will help you do.
Michael Helbling
Hey folks, let's talk about data pipelines. You know, nothing says fun like scalable architecture and secure etl, right? Okay, maybe not. But you know what is fun? Not actually sweating over whether your sensitive data is floating around the cloud uninvited. That's where fun FiveTrans hybrid deployment comes in. It's the magical middle ground between we need full control and please somebody else manage this thing for me. You get self hosted pipelines in your own environment, but with all the ease, monitoring and updates handled by Fivetran. It's like owning a fancy car and having Fivetran's pit crew keeping it running 24 7. It's cloud based on premises, secret volcano layer. It doesn't matter. Fivetran's got a unified platform that will manage it all with governance and security features that keep compliance teams calm and engineers caffeinated. Bottom line, your data's secure, your life is simpler, your pipeline's always on. Go to fivetran.comaph and start a 14 day trial today. That's F I V E t r a n.comAPH Go ahead, give your data the treatment it deserves.
Julie Hoyer
So my head always goes to how do you make sure it reflects like the real world enough though? Because I know you said it's not something that people just make up, it's supposed to be based on the real world. But like I can't wrap my head around like the use case you're talking about because I've talked to colleagues about this in the past and I was really uncomfortable with this idea of we didn't call it synthetic data. We called it like modeling missing data. And it was all around people not opting in. Okay, well what would that traffic be doing if they had opted in? Like, can we fill those blanks in our data sets? And my head, right or wrong, I was like, I can't compute how this will work. How can it not be biased? And how do we know what people who inherently don't want to opt in are going through the steps to opt out. Maybe they act completely different than the people we can sample to do the modeling on. So maybe we want to go down that rabbit hole. Like how, how do you make it not biased or try not to.
Winston Lee
Yeah, well, data sources are also changing, are also being updated regularly. So building a synthetic data set is very much like writing software. You don't just write code and kind of leave it there and you know, code will go stale. Somebody has to maintain it. Same thing with the data sets. Over time you're going to have. So first of all, even things like panels, every year they do it or every six months they have an update and you can, you know, you're going to pick, at least in our case we pick that up and we, we rerun retrain the models. In some cases a data source becomes stale over time. Like maybe, you know, a dataset source loses its reputation or you know, their panel is, becomes bad. We also had data sources, this is a case in Canada. We have a source where they try to save money by going into a different panel that's less trustworthy and their, their data up and sort of deteriorate over time. So part of the work is not only to generate synthetic data, but also to source the ingredients. Right. It's kind of like cooking, you know, like your job as the chef is not only to cook the food, but also where to buy and what food to buy. That's also part of the equation as well.
Julie Hoyer
That makes total sense. But then it kind of makes me feel like the specific use case of missing missing data modeling missing data that I was referring to still may not be then a good candidate for synthetic data. Unless you believe your assumption that like those that you are tracking are representative enough of those that you're not tracking. And I think to your point, it sounds like you'd have, you'd want some other type of research or something to hopefully validate whether that assumption would hold or not. But otherwise you're kind of going under the use case of like, I'm gaining more data on the population. I know you're not necessarily replacing the unknown population that you have really no data on.
Winston Lee
There is always how, there's always sort of the consideration of how much can you extrapolate and how much is extrapolation is appropriate. And this is not only the case with building synthetic, synthetic data or the synthetic generation models we develop, but also as a, you know, this is one of the first things we learn. I learned as a statistician when I was in third year university or second year university when they teach us to build regressions. You know, they might say if you build a regression model, let's say, where you forecast the housing prices based on, let's say, the square footage of the house, and if the range of your data was from, let's say, 1,000 square foot to, you know, 5,000 square foot, if that was the range of your input data, then you cannot ask the model to forecast a house that's, you know, 5 million square feet, you know, that's outside of your scope and that's too much extrapolation. So the same thing, the same kind of consideration here with synthetic data too is how much extrapolation is extrapolation. And obviously, you know, the whole point of building models is we want to extrapolate in some way. If we didn't want to extrapolate and there would be no need for a model. Right. So, you know, an appropriate scenario might be something like let's say if we didn't have any, you know, let's say if we were looking at pet ownerships, if we didn't have any data, let's say in the state of Arizona, but we had data from other states on, you know, what kind of people, dogs, what kind of People own cats. And then for the state of Arizona, we have, let's say the general population data. That's a pretty good extrapolation. You know, if you generally know what people look like, what kind of people, dogs, what kind of people, you know, own cats, you can, and especially for the surrounding states, you can make the assumption that people living in Arizona will likely behave the same way. And that sort of extrapolation is appropriate. Whereas if your data was like, I don't know, how long will somebody survive on Jupiter? Or something like that, where there's absolutely zero data whatsoever on that particular topic, any extrapolation you're going to make is inappropriate. So there is a little bit of a judgment call as to what is appropriate. And there's definitely no sort of fixed formula to say, well, let's just plug these numbers in and here comes the synthetic data and then we're done. It's definitely not that. There is certainly a level of maintenance both in maintaining the source of the data, in maintaining the methodology, but also in keeping your judgment up to date or keeping your assumptions up to date, so to say.
Julie Hoyer
Yeah, it sounds like it.
Val Kroll
Absolutely. So one of the questions that I had, because you kind of teased this a little bit earlier, Winston, about that synthetic data is helpful in some of the contexts for privacy concerns or where personally identifiable information is really sensitive. And a lot of times people, you know, the solution is like, oh, we'll scrub that data or we'll de. Identify it, or we'll aggregate it, but sometimes then that isn't the type of data you need for the problem at hand either. So is synthetic data like walking this balance between those two in some way, or if you could just expound a little bit on that and talk about some of the ways that synthetic data helps overcome some of the challenges that we're seeing more and more with the.
Winston Lee
Pii, for sure, sanitizing the data often is not good enough. And it is increasingly so because lawmakers lag technology, so to say, like, you know, a piece of technology becomes available or a new kind of way of tracking people becomes available. And then, you know, data sets get collected. And then people look at it, People go like, oh, this is dangerous. Then you go, you know, the lawmaker realizes it, you know, many months later, and then they update what privacy laws are. And, you know, on using this kind of data and not being able to use this kind of data, let me. So let me give you an example. One of the data sets we have access to is mobility data. Mobility data basically tracks people's location based on your lat alone. And so you carry your phone, it tracks you through a number of apps. You, they have a whole portfolio, apps that allow you to share your location, where you're asked to share your location. And once you agree to it, they will ping you every, you know, every couple of seconds or every minute or every few minutes and they will get your exact lat and long. So this is a type of data that people sell on the market. You can fairly easily acquire that data set. It is considered PII in the sense that there's no names attached to it. There's no, you know, there's no email, there's no phone numbers. The only identification you get is what's called sort of a device id, which is usually IDFA or, you know, depending on whether you're an iOS phone or Android phone, but usually a hash device ID that you would not be able to identify the person, but you can stick it into know a media buying platform and target that person. So that is considered PII because it's hashed. However, every couple of minutes you're going to get a ping for somebody and you're going to find the exact lat and long of that person. And the phone GPS is pretty accurate these days. I guess it goes down to maybe, you know, 10, 20ft and you know, in, in inaccuracy. So you could totally look at somebody's phone, plot it on the map, figure out where they are 3am at night, and then knock on their door and you know exactly where that person had been to in different parts of the day. So that is an example where just anonymization isn't really enough. It's a little bit of a gray area now. There's no policy on how mobility data can be used, what exactly is considered PII and so on. And then if you imagine joining that data set with, let's say something like census, where you might be looking at a very small area. The smallest census block in the US has less than 50 people. So you could pretty easily identify somebody. If you had someone's device, knowing they live there, and then took the census with, you know, less than 50 people, you could very easily guess a lot of data about that person. So that's just one example of how where each of the individual data sources is anonymized and privacy compliant, and yet they each reveal something about that person. And when you piece all these different information together, you get a pretty good idea of who that person is.
Michael Helbling
Yikes.
Winston Lee
This is why just sanitizing data isn't enough in most use cases. And this is the whole point why we sort of recreate synthetic copies of data sets to deal with that problem altogether. So if there's one takeaway from what I just said, that would be to uninstall all of the apps on your phone, if possible. Otherwise you're going to get tracked and your information is going to get bought and sold and it might end up in people's databases.
Julie Hoyer
So I turned all mine off the other week actually, because I got some like, scary, you know, real telling me, like they have all your locations. So hearing you say that makes me feel good that I did that one. So do you, do you then, are you saying a use case for synthetic data could also be that by adding in synthetic data you're almost adding like representative, maybe like volume, Therefore I don't want to call it noise, but it's almost like well designed noise. So would it help with the anonymity of like, combinations of data sets?
Winston Lee
Am I understanding that it's not so much noise. That's a different sort of area in data privacy called. No, no, not, not wrong term at all. There are people who do what's called differential privacy, which, which is actually along that line. Basically you take some data set, you add some noise, you add some noise to the data and then people could go back and identify where that data comes from. So that is a different area, but trying to address the same problem too. For us, synthetic data is much more about recreating the data set in such a way that statistical properties are preserved, but the actual sort of cells are different. So to say. So one example of that would be, let's say if the only thing I cared about was average. I have three numbers, let's say 1, 3, 5 as an example, and their average in this case is 3. A synthetic recreation of that might be something like 2, 3, 4 different numbers, but the average is also 3. So in this case, 1, 3, 5 and 2, 3, 4 are two sets of different numbers. But for the purposes of calculating the average, these two numbers serve the exact same purpose because they give you the exact same average. Now, of course, it's more complicated than that because statisticians or data scientists look for many more metrics. You'll look for mean or look for average. You'll look for correlation. There's a whole bunch of checkboxes on your synthetic data has to kind of tick all of the boxes. But that's just a simple example of saying, I'm not adding any noise. My Number could be my new data set could be 2, 3, 4. It could be 4, 3, 2. It could be 4, 2, 3. Doesn't matter what order it is. But all I care about is the statistical metric. In the end, that doesn't change. So that's really what synthetic data is.
Julie Hoyer
But you could beef up the 50 in that little town in hopes that, you know, it'd be harder to pin it down to one person.
Winston Lee
Yeah.
Julie Hoyer
Awesome.
Winston Lee
Exactly.
Val Kroll
One of the things and I. That was a really good explanation. I'm so glad you asked that question, Julia. It was a really good explanation. One of the things I saw in one of your papers, Winston, was the phrase assisting with low resolution data, which I just. That framing really connected with me because it. My interpretation of that is it helps with exactly what Julia is talking about assisting by kind of like increasing that small base size a little bit and in dealing with some of the anonymity. But also it helps when, like, the aggregation, it's aggregated it to the point where it's no longer helpful. Right. So it needs to provide more granularity, but not in a way that's identifiable. Is that, like, what is intended by that? Or what are some. Some ways that the low resolution kind of problem, if you will, is kind of solved in some other cases?
Winston Lee
Yeah. To answer that question, let me maybe just discuss a little bit about what we mean by resolution when we talk about data sets. In consumer data, it's very common for people to aggregate data sets. What we mean by that is you have lots of individuals, but because of, again, privacy, you can't give away information about each of the individuals. But we can group them, we can put them into cohorts and give you averages. And the common. A common way of doing that is by using geography. So in the U.S. for example, that could be census blocks. That could be a zip code, that could be a county, that could be a dma, whatever. Right. So rather than saying, you know, Bob makes $50,000 a year, you know, Sarah makes 30,000 a year, you know, rather than saying that, we say, well, this zip code, the average income is, you know, a hundred thousand. There are, you know, 10,000 people living in that zip. That's the average income. There's other zip. Average income is $30,000, for example. And, you know, there's. There are many. There are many advantages of using geography as a. As a way of aggregating data, partly because geography is quite well defined and straightforward. Everybody knows where, what zip they live in, which geography they belong to. But most importantly, Geographies don't complain. Zip code will not complain and say, oh, you know, you infringe my privacy. You need to take me off of that, you know, take me off of your data set or whatever. So it's a very common way for people to aggregate data. Now, the issue with aggregating geography is what geography to use. Because if the geography is too small, like if you're looking at, you know, every, hypothetically, let's say if every other house was grouped into some kind of a geography, that's probably not very helpful because I and my neighbor are in a same group. If we're talking about average income, I know mine, I can immediately calculate what his or what her average, what her family income is. Right? So that's an example of where your geography is too small and individuals could infer what people you know, still could infer personal information or personal data. And then on the other hand, you have a problem where if your geography is too big, let's say if your geography was the whole country, you're averaging too many people. And that's not very helpful. So you know, if you look at, if people say, well, the average age of the United States is, you know, let's say people living in the United States is, let's say, 48 years old. Well, guess what? That's probably true for every other country out there, you know, every. Because you're simply averaging too many people. And when you average too many people, the variation sort of washes out and you just end up with, end up with, well, the average. So finding the right balance is important. And that's what we mean by when we say high resolution. We mean the data is more granular at the closer to the individual level. Low resolution, meaning a lot of them have been grouped up, averaged, and you lost information as a process. In the US People typically, census block is a pretty common choice. Zip is a pretty good choice as well. So again, depending on the organization, depending on what their risk tolerance is, those are usually the, the, the geographic units that people work with.
Julie Hoyer
So then I guess to loop back to val's question, low resolution would not be describing the scenario we were talking about where there's 50 people in a small census block and you're combining it with another data set where it becomes somewhat more identifiable. So I guess, do we want to circle back to val, because you were kind of asking then for the low resolution problem. You said synthetic data can help a lot there, right? So what would an example, I guess be where synthetic data helps when you're using more aggregated low resolution data sets.
Winston Lee
Well, when, when, when you're dealing with low resolution data sets, what typically happens is, so there's, there's a little bit of a trade off. High resolution data, more information, more granular, but you know, less safe from a privacy standpoint. And then on the other side, where you have high, low resolution data that's more aggregated, less granular information being lost in the process, but the information being lost also means identities are lost and it's harder to connect back to the individual. So especially working with high resolution, low resolution data sets, you know, synthetic data could be a very helpful augmentation to getting more information. If you have low resolution data, if you have individual level data, then you don't need synthetic data. You know, you have what you need. So you can just, you know, go, go do whatever you need to do. The whole point of sort of leveraging synthetic data is when that high resolution granular data is not available and then synthetic data becomes your alternative to do to get to what you need to do, at least in a statistically meaningful way. Not to say your data set necessarily matches exactly the same as the low resolution data if you were to have it, but at least if you were to build models or do analysis on the synthetic data set, you can expect it to work just as well as if you were to use the real, real, real low resolution, high resolution data set.
Julie Hoyer
Oh, okay. This may, Sorry, I feel like I'm asking really naive questions, but this is so helpful selfishly for me. So when you're doing synthetic data for low resolution data, are you doing it to get more of the low resolution, like data points or rows of data, whatever it is. Or are you saying you're doing synthetic data to create a synthetic safe. It's not real people, higher resolution data source to augment your low resolution. Or maybe you could do both the.
Winston Lee
Latter more on the, more of the latter. So the way I describe it is it's like a little bit of, it's like data compression where you have high resolution data, let's say you have individual level data and people aggregate it into, let's say zip codes or census blocks. So people aggregate it to preserve privacy. And then now can we take the aggregated data set, apply some modeling, apply some algorithm to reconstruct the individuals that has been aggregated. So it's like data compression and decompression, except you're not compressing data to save disk space. Like you know how you zip, you put a bunch of files into a Zip folder. You're not doing it to save disk space, but from that compression and decompression process, you lost, you took away the identity. Oh.
Julie Hoyer
Oh, my gosh. Okay. It's clicking. So much more.
Val Kroll
Thank you for me, too. This is really helpful. And I think one of the other things. And I think about this from like, a. Like a digital marketing perspective, like, when you're trying to identify and pick audiences so you know how many people have, like, the pets like you were talking about before, and now you want to know also how many of them are interested in, you know, some higher education program. So you have to understand or try to figure out what is the size. My audience, if I'm picking, like, has this. This is true also. This, this. And so those combinations aren't easy to come by to try to, like, estimate how many people fall into that bucket. Because all you have is, you know, 30% of people have dogs in this, you know, zip code. Right. So is that kind of like where the. Where you get better with your estimations or audience sizes because you're making fake people that. Where, like, the statistics are still. What was. How would you phrase it earlier? Where, like, I forget what you called it earlier, that you would make sure it's still.
Winston Lee
Yeah, yeah. Statistically equivalent.
Val Kroll
Yes, yes, yes. Okay.
Julie Hoyer
Yes.
Winston Lee
Yeah, yeah, yeah. Exactly. Exactly. It. So, so let me give you a simple scenario. Let's say in a neighborhood, we have 10 people. According to the census, let's say half of them are male, half of them are female. And according. Let's say according to the census, again, half of them own dogs, half of them own cats. And then now you ask how many males own dogs in that neighborhood? Well, just by looking at the aggregated information, you can't. You don't know. You won't know. You could have one extreme where all of the five males own dogs and all of the female own. All of five females own cats. Or you could have it flipped. All of the five males own dogs, all of the five females own cats. Or you could have a pretty random mix of the two. And in all cases, your aggregated counts are the same, but they would mean very, very different things. So the way synthetic data, or at least how we develop our synthetic data, is to be able to take that aggregated information and model out 10 individuals so that we can now look at the 10 individuals and say, you know, oh, of the 10 individuals, you know, three of the five males own dogs and two of the males own cats or something along that line. Now, again, this is not necessarily exactly true. To match the population, at least at the individual level, you could go to that neighborhood and you could pull up three men and ask them, do they own dogs? You don't necessarily agree, you won't necessarily agree with our synthetically created individuals. But if you look at the national level, if you count up a lot of males, how, whether they own dogs or whether they don't own dogs, that should agree with, with, let's say, real stats.
Val Kroll
Hmm, that's interesting. And I'm also curious, like, when it comes to some of this, when you're doing some of this modeling, is it all like descriptive characteristics or. Because I'm thinking about my past life and market research where we did a lot of work, like awareness, trial and usage studies. And so it was like, aware of certain brands like likes or brands like, you know, what percentage likelihood to purchase that brand in the future. Can you, can you kind of do the same thing with like, attitudinal data? Or is it more just like descriptive, categorical or.
Winston Lee
No, absolutely both. Both works. It could be synthetically modeling categorical data. So yes or no, or, you know, you know, here are five different brands. Which ones have you purchased? It could be that or it could be numerical. It could be a number. And you know, you know, for example, how many, how many minutes do you spend on social media every week? It could, it could model that too, you know, from, from knowing, knowing the average. So, so, so, yeah, you can do lots of things.
Michael Helbling
Where do you see people using synthetic data in ways that are not useful? Like, what are some pitfalls to avoid with synthetic data?
Winston Lee
So there's a. So there's a, there's a. I think there are at least two that I can think of.
Michael Helbling
Okay.
Winston Lee
One is where they think synthetic data could just miraculously invent some stuff for them. And again, this we talked about in the sort of, the extra. How much extrapolation is extrapolation discussion? A little earlier, some people might say, oh, you know, I have this data set. You know, I don't have this data set. Can you get a synthetic data, synthetic version for it for me? Well, that'd be pretty hard to do. You need to have. You need to base it on something. Right. Although in the last little while, people have been discussing the ability to integrate something like an LLM into the synthetic data generation process with the assumption that if the LLM has been trained on all resources it could access to across the Internet, then it should, in theory, know everything. Know everything. But that's a little bit of A different topic. Still, understanding data, you know, what data comes out based on what you input is very important. So I think this is definitely one thing. The other thing that sometimes people don't get with synthetic data is that it's only statistically meaningful, it's only statistically equivalent. And when we say statistically equivalent, you need to sort of look at the data, you know, not just looking at one individual, but you have to look at. Across a group of them. And so, for example, you know, when we, you know, when, when someone looks at our synthetic data, synthetic society data set, you know, some people, the first thing they'll do is go to their zip code or go to their census block, pull up the person with their exact name, with their exact age or exact gender, and they start to compare that to themselves. And then they say, like, this person's not the same as me. Like your data is not accurate. And this is true in, you know, if you just look at one individual, that individual might not be the person that you really wanted it to be, which in this case your song. But again, that's not the idea of synthetic data. It's almost like insurance, you know, I spent as a statistician. You know, we, you know, we. I work pretty closely with actual science back in my days in the university. And in actual real science, people essentially calculate when you're going to die. That's what people do. So they'll say, you know, oh, this kind of person, they'll die at the age of 50. This kind of person, they'll die at the age of 70. It doesn't actually mean you, you know, will die at the age of 50. And, you know, by the 50, 50th birthday, you'll just like go drop dead. That's. That's not what it means. It just means, you know, across a group of people like that, you know, on average, they're expected to die at the age of 50. So synthetic data is like that too. Not to say exactly. You know, this exact record is Michael, this exact record is Winston, and so on. So, and so it's not that it's across a bunch of people similar to Michael or similar to Winston. In our synthetic data, we're also going to have people like that. So. Yeah.
Julie Hoyer
Well, it's interesting too. Do you find it's hard with, you know, stakeholders you've worked with? Is it ever hard to have the conversation where you're asking them maybe to define and clarify which statistics matter to the questions they're trying to answer? Because it sounds like to Your point? Like, you're trying to create the synthetic data to check certain boxes of which statistics it is like preserving in the data set. So do you ever run into, you know, customers coming to you or stakeholders you've worked with in the past that like, they struggle to define that, and so it makes it hard to like, create useful synthetic data for them?
Winston Lee
Not, not so much. But that said, we do get. So let me, let me backtrack a little bit. When I say not so much, it's because we tend to not create custom synthetic data for every customer we. We work with. You know, we have one Synthetic Society synthetic data product built out, and that particular product is based on, you know, 10, 20 different data sources. And we tell people, this is the one thing we have. So let's say for the United States, this is the one Synthetic Society data set we have. You can use that. You know, there are lots of fields in here. There might be like 50,000 fields. You might not need all 50,000 variables. You might only need a part of that. But generally speaking, what's there is enough for most people's needs. So we don't tend to create custom instances for every individual. There might be a couple of customers that come in and says, oh, I have a couple of variables. I've done a ad hoc, you know, I've commissioned, I don't know, let's say Kantar, to do a custom study for me. And here's several questions. Can I embed that into the Synthetic Society? So there's that kind of request. Now, that said, people do care about data quality. So even though we're not generating data sets ad hoc for every customer, we do get people who ask, like, how do I trust your data? How is it accurate? Or how, you know, how, why should I use your data set versus use something else? So we do get that kind of question, and depending on who the audience is, the level of sophistication of the person who's asking that question, we sometimes give them different things. So sometimes we have very technical customers. And this is when I would go through the methodology with them. I share our paper with them. I share how the method works. We have some customers who want to validate the data through use cases. So maybe they'll say, okay, give me some data. I'm going to build a model. I'm going to use that to enhance my model and see if my lift increases. There's also companies like, there's also people like that. There are also people who, you know, look at logos and say, oh, what kind of clients you work with, and if you can give them big logos, so be like, okay, fine, you know, so we, we deal with. We deal with all kinds of clients. But one of the things I try to do is to try to be transparent. As transparent as possible. You know, I'm very transparent about the methodology. Exactly. The steps we use to generate the synthetic population versus to say, oh, this is like trade secret, like you just have to trust it kind of, kind of thing. I come from a. Somewhat of an academic background, so to me, like this kind of open. Open source, let's say, is very important.
Val Kroll
I like it.
Julie Hoyer
Yeah, absolutely.
Val Kroll
Going back to one of your watch outs, it made me think of where you're talking about just like materializing data out of thin air, essentially. Michael, I'm going to put you on the spot and if you don't recall, then we'll just cut this part out. That company that you sent me because you knew about some of my market research ruse that was like, forget serving people. Just put your questionnaire into this tool and it will model out the responses based on like, the different respondent archetypes you're interested in. Right. Like, am I representing that? Well, from what you were.
Michael Helbling
Yeah, yeah, that's. I've seen more than that. Now there's another one. So they're using LLMs, like you mentioned, Winston, to come up with basically a response to, let's say, a survey. You give it sort of like, here's who I'm targeting. And they'll be like, okay, we'll construct an LLM version of that person who will then respond to your survey and give you the feedback that you want to hear.
Winston Lee
Yeah.
Julie Hoyer
Oi, oi, yoy.
Val Kroll
I would rather every app on my phone track me than make business decisions off of whatever garbage LLMs is spitting out for those different micro segments. I feel like that's dangerous.
Winston Lee
Well, well, actually, if you. This is something we do as well. And if you, if you do it properly. I mean, properly is a bit of a strong word because everybody thinks they do properly, and we certainly think the way we do it is properly. You, you, you can get some very. You can get some fairly decent results out of that. Let me, let me describe to you how, how we do it and, you know, why we think. It's, it's. It's actually, you know, kind of legit, you know, quite legit. Last year, right before the US election, one of my academic collaborators tagged me and asked me whether I want to write up a paper. He's now He's a professor at University of Southern California and I've known him for, for, for a long, long time. And so I said, you know, sure, what do you, what do you want to write about? And, and of course he's intimately, well, he's well aware of what we work on near our synthetic, our synthetic population, you know, all the stuff that we do. And he said, you know, right around the election a lot of people are talking about using LLMs as a, as, as a way to predict somebody, whether, you know, who, who will somebody vote for. He says, but you've got this unique advantage in that you have a data set that matches the exact distribution of the population. So in theory we can go through every individual in your population and ask the LLM to impersonate that person. So for example, to say you are a, let's say you are a 24 year old, you know, black female with, you know, this kind of family income, with this many kids living in the county of X in a state of Y, for example. Now, you know, now it's election time. Here are the two candidates, here are the policies, and who are you going to vote for? You can do that, you can repeat that 40, 300 million times. 300 million times, or well, everybody who is eligible to vote, obviously you repeat that many times and then you can do a bottom up approach as opposed to just asking the LLM, you know, how will Arizona vote, how will California vote? How will this stable, rather than doing it top down, you do it bottom up. And the assumption of course is that the LLM may contain some bias in that, you know, the LLM may be trained, let's say, mainly based on Reddit data. And Reddit is, you know, you know, are generally young people leaning towards, you know, maybe voting for one party or voting for one candidate. Now we actually, if we supply the LLM with enough information like the person's age, the person's financial situation, you know, all these things about this person, how we'll evoke. So that's how we do it and that's how we sort of connect LLMs with our existing synthetic society to try to bottom up, you know, ladder up each county, ladder up each state, ladder up and go up to see how well people respond to certain things. So that's how we've done it. The end result was interesting.
Val Kroll
I was going to say, are you going to share?
Winston Lee
I will share. The end result is funny. It's both right and wrong how I describe it. It's wrong in the sense that it did not predict Donald Trump would win. But it was fairly close to what the market research polls were saying. So we had pretty much very similar numbers to polls, what polls were saying. So if you look at this as, does it predict the election correctly? The answer is no. If you then ask a different question, if you say, does this give similar answers to polls but at a cheaper and faster rate, then the answer would be yes. And then along the way we, we, we learned, you know, after the fact, we tried different things. Like for example, instead of just prompting the person to say, do you vote for candidate or A or candidate B, you may first have to ask, you know, will you vote in the upcoming election? And then only if the answer is yes, then you prompt, you prompt the machine again to say, okay, now given you're going to vote, here are the two candidates. Who are you going to, who are you going to vote for? So then the LLM would know this kind of a person is more likely to vote, this kind of person is less likely to vote, and so on. So there are, you know, there are nuances to how these kinds of things are done. And yeah, so, you know, people are always looking at ways of improving market research. And so this is one of the few things that people talk about greatly in the last little while.
Val Kroll
That's very cool. I mean, when Michael first sent that to me, I'm like, what type of garbage water is this? But that's legit. That sounds pretty interesting.
Michael Helbling
Well, in the interim, Val, I've stumbled across a couple more companies in the same space space. So like, obviously it's getting some kind of traction. Yeah, it's a thing, it's AI smarter than all of us.
Julie Hoyer
But I wonder if it comes down too to like, validity of the questions you're asking or the topic you're wanting to pull them on. Like, kind of back to what we were saying. Like, you'd have to have sufficient data on how someone, like factors that maybe would influence that answer. Like, I feel like politics right in that area. There's a lot of data there that would help. And so I'm wondering too, like being smart, if you're a market needing market research and going to a company like that, like, you'd probably have to think through a lot of the assumptions you have or again, what are your knowns and unknowns and kind of balance, like, do I believe that this is going to be representative? So that's interesting to think about.
Winston Lee
Correct. And there's, there's again, you know, a couple of different layers to, to, to that question, for one thing, even how you prompt the LLM is, Will make a difference. But in market research, how you ask the question will also make a difference. So it's not like that's just a problem to, for LLMs. It's also a problem for, for people. And then the other thing is as you are prompting, so prompting is really important. Well, as is with, you know, almost anything when you deal with LLMs, when you, when you prompt that, that person, what kind of an answer you give, what kind of information, biographic information you give about that individual will also make a very big difference. So, for example, and, and, and this is how, and if you want to make the information more, you know, you want to make your poll more simulated poll more, more accurate, so to say you need to prompt, prompt it with the right information. So let me give you an example. We have the California Wine Institute as a client, and when, when the U.S. canada trade war first started, all of the American wines got pulled off, off the shelf. And this client we work with, which is one of the neighbor of one of our team, you know, one of our team members, she's the head of California Wine Institute for Canada. And, you know, the next morning she's looking at news and going like, you know, how do I even do my job? They don't sell American wines anymore in, in Canada. So, you know, she went to us and said, you know, can you do a poll for me asking who will resume buying California wine when, you know, Trump stops calling Canada the 51st state, or when. Or when, you know, the trade war is over. And so in answering this question, you can prompt the machine, you can prompt the LLM in very different ways. You could prompt somebody we have data on, for example, who's currently buying American wine. So that would be a piece of information that we supply to the LLMs. Obviously, your basic things like your age or gender or your income where you live would play a role as well. But there could be a lot of, you know, psychographic data that you put in this as well. So again, depending on what data you have already and how you prompt the machine, you will get different. You'll get slightly different answers. So just like how people do real polls, you know, pulling an LLM has its own nuances as well, so to say.
Val Kroll
Yeah, so interesting.
Michael Helbling
All right, this is so interesting. And I'm, I think this is a topic that is sort of like, not a lot on people's radars. So thank you so much, Winston, for kind of coming in and sharing some of your Knowledge with us. We do have to, unfortunately, start to wrap up and kind of close up the show. But this has been really fun to talk about and hear from you on, so I really appreciate that. But one thing we do on the show is last call. We just go around, share something that might be of interest to our listeners. Winston, you're our guest. Do you have a last call you'd like to share?
Winston Lee
Sure. You know, I'll be in Chicago. I think at least Val is in Chicago. I'll be in Chicago in September, and we're sponsoring the ANA Measurement Conference. So if any of the listeners of the podcast are based in Chicago, would love to connect.
Michael Helbling
That's awesome. And are you going to come to Measure Camp Chicago, too?
Winston Lee
I'll try.
Val Kroll
I've already started to twist Winston's arm.
Michael Helbling
That would be awesome.
Winston Lee
Stay till Saturday.
Michael Helbling
Oh, okay, good. Because we'll be there. We'll be there. So it'll be great to see you again.
Val Kroll
You have to go to me next because that's going to be my last call. Sorry, I'm teeing myself up for the last call.
Michael Helbling
It's a natural next step, Val, that maybe we should hear what your last.
Val Kroll
Oh, I'm so glad you asked, Michael.
Julie Hoyer
Well.
Val Kroll
I wanted to encourage all of our listeners to consider joining us in Chicago for our second annual Measure Camp. It is on Saturday, September 13th, and we will be downtown again at the Leo Burnett building, which if you're not familiar with Chicago, Chicago was built around being enjoyed from the river, from the water. And it's right on the water. It's a great location. We had just over 200 people join us last year, and we expect even more people there this year to celebrate and to connect and share and learn from. So we would love to see you. So Measure Camp Chicago, you can grab your tickets now.
Michael Helbling
I have my ticket already.
Val Kroll
I am ready.
Michael Helbling
That's awesome. All right, Julie, what about you? What's your last call?
Julie Hoyer
Well, my last call, it happens to. To be an article. Cassie Kozakov, one of our faves. I know we talk about her a lot, but it's a recent article she wrote and I just found it really interesting. It was how to Pick a College Major in an AI First World. And I really like the way she framed it. She breaks it down into. It's funny. She's like, you should get into two different types of majors, either a major in clarity or a major in curiosity. Curiosity. And she kind of talks about why each one is going to be really beneficial to AI and it made Me feel better about my major. It technically is one in clarity, so it's cool the way she talks about, like, well, if you go for clarity, how does it help you for AI readiness? And it talks a lot about, like, it is a rigorous black and white. Like, there is a right answer, a wrong answer. And it helps you a lot with, like, logic and thinking through a process problem and how do you frame a problem? And so how can you think about the use of AI more like not right and wrong, but the logic side of it instead of like soft logic things like, it's a more rigorous, like, you got to stay within these bounds type thinking. And then curiosity is kind of on the other side. It's very much around majors that are going to help you with adaptability and lifelong learning. Like, how do you fuel your strengths and what you're curious in? So I just really loved her framing because I think a lot of people now are like, should I go to college? Do I need to go to college? How do I get ready for a job that's going to use AI? Inevitably. So it was just a really good read. Whether you're thinking of going to college right now or not, it's a good one.
Michael Helbling
I've been to so many graduation parties over the last couple months, so this is good.
Julie Hoyer
It's timely, right?
Michael Helbling
Yeah.
Julie Hoyer
Yeah. What about you, Michael?
Michael Helbling
Well, I also have something that's AI related. There's a paper that recently came out that took a group of users and had them take a quiz. And then while they're taking the quiz, they compared the ability of an LLM to influence their answers versus a human who was paid money if they were able to correctly influence or incorrectly influence their answers. And the LLM vastly outperformed the humans in its ability to persuade people.
Val Kroll
It's got goosebumps.
Michael Helbling
Yeah. So it's. It's already there that LLMs are extremely persuasive. And so the paper walks through some of the things. But we'll put a link on this on the website if you want to check out that paper. It's pretty interesting initial read. There's a lot more research to be done, better experiments to do in this field, but as a first little read, it was quite interesting. So, anyway. All right, well, it's been such an honor and a privilege, Winston. Thank you so much for coming on the show, spending some time with us. Really appreciate it.
Winston Lee
Yeah, likewise. Thanks for having me.
Michael Helbling
Synthetic data. It's sort of going to be. I think it's. It's got a big future. So it's really cool to kind of break into this topic for the first time on the show. We've been doing the show for 10 years. We've never talked about synthetic data. So this is a good it's finally time. So yeah, thank you so much. And of course, as you've been listening out there, you might have more questions or you might want to join the discussion. We would love to hear from you. The best way to do that is through the Measure Slack chat group or on LinkedIn. Or you can also send us an email at contactnalyticshour IO. We'd love to hear from you. And of course we want to always give a big shout out to Josh Cohurst, our producer, because of everything he does. And thank you, Josh. And of course, I think I can speak for both of my co hosts. Whether we're using real data or synthetic data. Val and Julie, I think we all agree people should keep analyzing.
Winston Lee
Thanks for listening. Let's keep the conversation going with your comments, suggestions and questions on Twitter at analyticshour, on the web at analyticshour IO, our LinkedIn group and the MeasuredChat Slack group. Music for the podcast by Josh Crowhurst so smart guys wanted to fit in, so they made up a term called analytics. Analytics don't work. Do the analytics say go for it.
Julie Hoyer
No matter who's going for it.
Winston Lee
So if you and I want in the field, the analytics say go for it. It's the stupidest, laziest, lamest thing I've ever heard.
Val Kroll
For reasoning in competition, Rock Flag in low resolution.
The Analytics Power Hour: Episode #274 – Real Talk About Synthetic Data with Winston Lee
Release Date: June 24, 2025
In Episode #274 of The Analytics Power Hour, hosts Michael Helbling, Val Kroll, and Julie Hoyer delve into the burgeoning field of synthetic data with esteemed guest Winston Lee. Synthetic data, a term that might evoke images from a sci-fi narrative, is increasingly integral to today's data-driven landscape. The conversation sets the stage by highlighting synthetic data's role in machine learning, privacy protection, and enhancing dashboards when real data poses challenges like missing values or outliers.
Michael Helbling opens the discussion by emphasizing the practical applications of synthetic data:
"Whether you're using it for machine learning, protecting privacy, or just giving your dashboard something to chew on when the real data won't play nice, it's definitely having a moment in our industry." ([00:14])
Winston Lee provides a foundational understanding of synthetic data, distinguishing it from anonymized or merely sampled data. Unlike data collected from real-world events, synthetic data is generated algorithmically to mimic real data patterns without directly replicating actual records.
Winston Lee clarifies:
"Synthetic data, to put it simple, it's data sets that are generated by an algorithm as opposed to being collected from some sort of real event." ([02:20])
He further demystifies synthetic data by asserting its authenticity and utility:
"We're not making up data. The algorithms that we use to generate synthetic data are indeed trained on real data... it is based on learnings of patterns from real data." ([04:00])
The discussion transitions to practical applications, where Winston elucidates that synthetic data primarily serves as an alternative to real data in scenarios constrained by privacy laws or procurement challenges. It's not about introducing entirely new capabilities but about enabling existing processes in a privacy-compliant manner.
Winston Lee explains:
"People consider synthetic data more as a way to, let's say, be able to do things that, you know, privacy laws don't otherwise allow them to do." ([05:51])
Val Kroll probes deeper into specific use cases, prompting Winston to discuss how synthetic data can augment low-resolution datasets, enhancing their granularity without compromising individual identities.
A critical aspect of the conversation revolves around the limitations and potential misuses of synthetic data. Winston warns against the misconception that synthetic data can "magically" generate information without a real-world basis.
Winston Lee cautions:
"One is where they think synthetic data could just miraculously invent some stuff for them." ([34:35])
He also emphasizes that synthetic data is only as reliable as the statistical properties it preserves:
"It's only statistically meaningful, it's only statistically equivalent... you have to look at it across a group of them." ([34:42])
Julie Hoyer raises concerns about biases, especially when synthetic data is used to model missing data, highlighting the importance of ensuring that the synthetic data accurately represents the underlying population.
The conversation contrasts synthetic data with other data privacy techniques like differential privacy, which involves adding noise to datasets to obscure individual identities. Winston delineates the distinct approaches:
Winston Lee states:
"For us, synthetic data is much more about recreating the data set in such a way that statistical properties are preserved, but the actual sort of cells are different." ([20:28])
This distinction underscores synthetic data's focus on maintaining utility while ensuring privacy, as opposed to merely obfuscating existing data points.
Winston shares insightful examples of synthetic data in action within market research. He describes a scenario where synthetic data enables nuanced audience segmentation, facilitating more accurate estimations of consumer behavior without exposing individual identities.
Winston Lee elaborates:
"So, so, let me give you a simple scenario... synthetic data is like that too. Not to say exactly... but at least if you were to build models or do analysis on the synthetic data set, you can expect it to work just as well as if you were to use the real, real, real low resolution, high resolution data set." ([29:27])
Additionally, Winston touches upon the integration of Large Language Models (LLMs) with synthetic data to enhance predictive analytics, demonstrating innovative intersections between different data technologies.
The episode explores the future trajectory of synthetic data, highlighting its potential to revolutionize data privacy and analytics. Winston underscores the importance of continuous model maintenance and the need for thoughtful application to avoid over-extrapolation.
Winston Lee advises:
"There is a little bit of a judgment call as to what is appropriate. And there's definitely no sort of fixed formula to say, well, let's just plug these numbers in and here comes the synthetic data and then we're done." ([12:22])
As the episode wraps up, the hosts and Winston reflect on the significance of synthetic data in the evolving analytics landscape. They reiterate the necessity of understanding its capabilities and limitations to harness its full potential responsibly.
Michael Helbling concludes:
"Synthetic data. It's sort of going to be. I think it's. It's got a big future. So it's really cool to kind of break into this topic for the first time on the show." ([56:10])
Winston Lee urges listeners to stay engaged and continue the conversation through various platforms, reinforcing the collaborative spirit essential for advancing the field.
Notable Quotes:
For those interested in exploring synthetic data further or engaging with the community, consider joining discussions on the Measure Slack chat group, LinkedIn, or reaching out via contact@analyticshour.io.
This summary captures the essence of Episode #274, providing a comprehensive overview of the discussions on synthetic data. Whether you're a seasoned data scientist or simply curious about data analytics, this episode offers valuable insights into the current and future state of synthetic data.