wavePod

Seeing the unseen: combining data to better understand our environment - LSE: Public lectures and events | Wave AI Podcast Notes

Back to LSE: Public lectures and events

Seeing the unseen: combining data to better understand our environment

LSE: Public lectures and events

Wed Oct 29 2025

Summary

Podcast Summary: LSE Public Lectures and Events

Episode: Seeing the Unseen – Combining Data to Better Understand Our Environment

Date: October 29, 2025

Main Theme

This episode brings together perspectives from statistics and economics to explore how combining diverse data sources—ranging from satellites to social media—can help us better understand, monitor, and manage environmental challenges. Professor Claire Miller (University of Glasgow) shares advances in environmental statistics and data fusion. Dr. Sephi Roth (LSE) offers the economic angle, focusing on air pollution, information quality, and policy impact. The discussion underlines the critical importance of integrating multiple data streams, the challenges of data quality, and the implications for research, policy, and individual action.

Key Discussion Points and Insights

1. Introduction & Framing (00:14)

Host: Mila Vojnowicz, Head of the Department of Statistics, LSE, introduces the event, presenting Professor Claire Miller and highlighting the importance of environmental statistics and the value of cross-disciplinary collaboration.
Emphasis on the event’s goal: bridging statistical methods and environmental economics to address real-world challenges.

2. Professor Claire Miller: Data Fusion in Environmental Monitoring (04:15)

(a) Global Lake Water Quality and Satellite Data

Project: The Global Lakes Project (NERC-funded)
Data Types: Satellite data (MERIS instrument) and processed chlorophyll data serve as proxies for water quality.
Objective: Understanding global patterns and temporal changes in water quality, analyzing how lakes may "cluster" in their response to environmental change.
- “Lakes themselves are thought to be sentinels of change. So if we can understand the processes and changes at work within lakes, then it can help us to understand more about environmental change...” (05:38, Miller)
Key Insight: No single data source provides a full picture; integrating satellite and ground-based data strengthens inferences about environmental change.

(b) National River Monitoring & Chemical Mixtures

Project: MOT for Rivers (England, Scotland, Wales)
Data Types: Agency data on nutrients and metals at 13,000+ sites.
Objective: Go beyond single-chemical monitoring by analyzing chemical mixtures and their interactions with landscape and biology.
- “We now have the possibility to think about, is it actually the mixture or the interaction of chemicals in the water that we need to carefully control?” (07:07, Miller)

(c) Urban Environment & Non-Traditional Data

Project: GALLANT (Glasgow as a Living Lab)
Data Sources: Classical (official stats), unstructured (photos, social media), citizen science.
Key Challenge: “The data associated with these challenges can be both diverse and complex… we’re interested in a variety of different data sources.” (09:53, Miller)

(d) Types and Challenges of Environmental Data (13:00)

In-situ measurements, automatic sensors, remote sensing, and citizen-driven data all offer unique perspectives but need careful integration.
Key Challenges Identified:
- Heterogeneity: Data recorded at different times, scales, and qualities.
- Missingness & Connectivity: E.g., network effects in river systems, missing points from sensors.
- Bias: Especially acute when using non-traditional/citizen data (self-selection, design).
- Uncertainty: Must be quantified and communicated in predictions.

(e) In-Depth Example: Data Fusion for Water Quality (18:30)

Case Study: Lake Balaton (Hungary)—combining nine in-situ sampling points with 7,500 satellite pixels.
- Approach: Statistical modeling aligns more accurate but sparse in-situ measurements with broad-coverage but lower-accuracy satellite data.
- Outcome: Better high-resolution predictions with quantified uncertainty, guiding future sampling strategies.
- “Such approaches enable us to investigate the patterns and the relationships… giving us that uncertainty information.” (23:13, Miller)
Extensions: Methods scale up for higher resolutions, adjust for different variables (wind speed, soil moisture), and incorporate multi-source data even with misaligned space/time stamps.

(f) In-Depth Example: The GALLANT Project and Non-Traditional Data (28:00)

Framework: Donut Economics—sustainable, socially just urban design.
Data Innovations:
- Image Analysis: Using Flickr photos, image captioning, sentiment analysis to understand environmental perceptions and events (e.g., storm damage).
- Community Integration: Linking social media, citizen app (Communimap), and survey data.
- Quote: “We’re interested in these variety of different data sources to see, do we get the same messages, do we get different messages, what could be driving that?” (36:28, Miller)
Event Detection: Potential for near real-time detection of environmental incidents via crowdsourced data.
Key Takeaway: Statistical and analytic innovation are essential for extracting actionable insights from increasingly complex and non-traditional datasets.

(g) Reflections and Future Challenges

Need for careful, question-driven integration of data streams due to risks of bias, misrepresentation, and data overload.
Intersection with AI: AI/foundation models offer new forecasting tools, but require rigorous evaluation and thoughtful combination with established statistical frameworks.

3. Dr. Sephi Roth: The Economist’s Perspective (44:42)

(a) Information as the Foundation for Efficient Policy

Central Thesis: “Economists deeply care about information as we believe that information is one of the core foundation[s] for efficiency. So without it, markets will allocate resources inefficiently, policies are unlikely to be efficient…” (45:00, Roth)

(b) Air Pollution as a Case Study for Data Integration

Spatial Scale: Satellite data offers global coverage but often poor local resolution; station data (e.g., London) has high temporal, low spatial reach.
- Illustrative Example: “If you just… measure the pollution on Kingsway… then measure… on the other side of the building…, you are very likely to get very, very different results…” (46:50, Roth)
Temporal Scale: Satellites give momentary snapshots; stations provide continuous time series.
Placement Bias: Monitors are not always randomly distributed; sometimes placed away from pollution “hot spots” for regulatory reasons.

(c) Exposure vs. Pollution Concentration

Exposure, not just ambient concentrations, matters most for public health, education, productivity.
Challenge: Need to combine multiple data sources (outdoor, indoor, mobility) to estimate true exposure.
- London Study Example: Joint Camden Council project with in-home air monitors showed ambient data “predict very badly the indoor environment. And then… the reason is that because there are many, many indoor sources, even in London.” (55:13, Roth)

(d) Impact of Better Information on Behavior and Policy (58:18)

RCT Finding: Simply providing real-time indoor pollution data to households reduced their exposure by over 30%.
- “Just providing the information, not telling them what to do… reduced pollution exposure in the home… by more than 30%.” (59:24, Roth)
Policy Angle: High-quality, precise data critical to effective, proportionate regulatory design (e.g., setting pollution taxes at the right level).

(e) Caution: More Data Isn’t Always Better

Importance of data quality: Combining datasets can amplify errors if sources are flawed; must rigorously validate and triangulate whenever possible.
- US Example: High agreement between satellite and ground station data in the US, but poor correlation in many other regions.
- “We need to be very careful and always verify the data source…” (62:54, Roth)

(f) Conclusion

The future lies in integrated, multi-stream data approaches, executed with robust validation and awareness of their limits.
- “We need to do it well. We need to be careful with the data that we use, which source we use and how we use it.” (64:49, Roth)

Notable Quotes and Memorable Moments

Miller on Interdisciplinarity: “It’s all about a question… The approach that we take… is very much driven by the important question that we’re trying to answer… What might be the data associated with that? And therefore, what is an appropriate statistical approach?” (11:18)
Miller on Statistical “Toolbox”: “I always talk about a toolbox of evidence… I want to have that trust in the answers… it’s about that full spectrum into a box of evidence.” (72:01)
Roth on Data Quality: "More data are better. And this is true most of the time, but not all of the time, because data quality really matters." (61:24)
Roth on Empowering Individuals: “Just providing the information, not telling them what to do… reduced pollution exposure in the home… by more than 30%.” (59:23)
Engaging Q&A: Discussion ranged from methodological details about data fusion and model uncertainty, to the difficulties of data access, to the tension between metrics standardization and preserving rich biodiversity information.

Important Q&A Segments with Timestamps

On Uncertainty in Predictions and Machine Learning Applications (67:03)
- Discussion about dispersing uncertainties across spatial models and the potential for image-based, ML-driven environmental predictions in future work.
On Risks of Unstructured Data Swamping Structured Data (70:02)
- Miller emphasizes need for “toolbox of evidence,” careful weighting, and critical appraisal of data streams.
On When to Stop Collecting Data (73:46)
- Miller notes that data collection frequency should be tailored to the question and nature of the variable (e.g., water vs. air pollution).
On Data Access and Data Sharing Barriers (77:06)
- Acknowledgement that even with open access trends, data often remains unavailable at the granular/raw scale needed.
On Standardization versus Complexity in Biodiversity Reporting (79:35)
- Panel recognizes the tension between richness and usability—“standardization can make data less informative, yet not enough standardization risks overwhelming users.”
On Future Impacts of Cheaper Satellite Data and AI (84:59)
- Panel discusses likely growth in data volume, game-changing technical innovations (e.g., MTG S1 satellite offering 3D atmospheric data), but underscores the continuing need for validation and human judgment.

Final Thoughts

Integrating multiple, heterogeneous data streams offers the potential for deeper, more accurate insight into complex environmental systems, but requires rigorous methodology and attention to data quality, bias, and appropriate analysis.
Advances in technology—satellites, AI, citizen science—are rapidly shifting the field, but foundational statistical principles and careful, question-driven research remain central.
The economic perspective highlights that better environmental information can directly drive more efficient, effective policy, as well as empower individuals in their daily decisions.

Recommended For:
Researchers, policymakers, data scientists, environmental economists, and anyone interested in the intersection of statistics, technology, and environmental sustainability.

For further reading and details, Professor Miller referenced her collaborators, NERC projects, and data resources; listeners are encouraged to follow up with the references provided during the talk.

Loading summary...

Transcript

A (0:02)

Welcome to the LSE Events Podcast by the London School of Economics and Political Science. Get ready to hear from some of the most influential international figures in the social sciences.

B (0:14)

While people are still coming in, let me welcome everyone also people who attend online. Thank you for coming to this LSE event. My name is Mila Vojnowicz, I am the head of Department of Department of Statistics. I'm a Professor in Data Science. We host events like this once in a while to engage with the research community and also public. So the event always open to anyone. I'm very pleased for this event to be co hosted with the Department of Geography and Environment. I'll tell you maybe a few words about Department Statistics before moving to Introducing the speaker. We are very proud of our history. The Department is of Statistics is one of the founding areas of lsc. We offer teaching at different levels including undergraduate, Bachelor of Science, Master and PhD across different subjects including Data Science, Actuarial Science, Statistics and Quantitative Finance. We also offer degree programs across different departments, that is we deliver programs jointly with other departments. For instance MSc in Health Data Sciences offered jointly with the Department of Health Policy. And we have a new exciting program, BSc in Economics and Data Science jointly with the Department of Economics. So a lot of exciting things going on. We conduct research in different areas including data science, probability, finance and insurance, social statistics and time series and statistical learning. So much about the department. Let me move on to introduce Claire Claire Miller I'm the privilege to introduce Professor Claire Miller from the University of Glasgow who will deliver a talk entitled Seeing the Combining Data to Better Understand the Environment. This talk addresses the important topic of environmental statistics and Claire is a world leading figure in this area. Her work focuses on developing spatial, temporal, statistical and data analytic methodologies to address questions motivated by real world problems, typically those related to the environment. She has been involved in a range of projects related to urban sustainability and several projects on water quality monitoring and modeling. Her contribution lastly her contributions have been recognized by the 2025 Barnett Award for Outstanding Contributions to the Field of Environmental Statistics which is which is awarded by the Royal Statistical Society. Following Clare's talk we'll have Sethi Roth who is actually local from this building. He's an Associate professor of Environmental Economics with the Department of Geography and Environment. He also is the founder and co leader of the Economics of Air Pollution Research Group at the Grantham Research Institute on Climate Change and the Environment. I hope this will lead to an engaging discussion as we bring together an expert in statistical methodologies with an expert in environmental economics. So it's an interesting mix. Without further ado, I will give the floor to Claire, please.

A (4:15)

Many thanks, Milan. I am delighted to be here this evening. Many thanks to the Department of Statistics, to the Global School of Sustainability, and to all the event organisers for making tonight possible. And of course, many thanks to Sephi for joining us later as well. This is joint work with very many colleagues and so I will do my best to acknowledge them throughout the presentation, but will also say more about my collaborators at the end of the presentation. So I wanted to start by telling you about some of the projects and questions that are of interest to us and that we've worked on or are working on just now. So, globally, what is the status of lake water quality for 1000 large lakes globally? So this was the focus of our Global lakes project. It was funded by the Natural Environment Research Council and we were specifically interested in satellite data. So we had data from meris, so the Medium Resolution Imaging Spectrometer, that was an instrument on the European Space Agency's Envisat platform. And so we had satellite retrievals from meris, and my colleagues at the University of Stirling and also Plymouth Marine Laboratory processed those satellite retrievals. Initially, it gives us information on color reflectance, but they processed the product so that we were working with chlorophyll. So we had information on chlorophyll for these 1000 lakes that you can see indicated by the dots on the map. Now, chlorophyll is an indicator of water quality and typically we can think of higher levels of chlorophyll as indicating poorer water quality. So lakes themselves are thought to be sentinels of change. So if we can understand the processes and changes at work within lakes, then it can help us to understand more about environmental change and the potential impacts of global changes. So we looked at chlorophyll in each of these lakes. We looked at it across the water quality for the lake, but also over time. And we were specifically interested in something called temporal coherence. And so that's what we're trying to get a handle on, in the groupings that you can see down in the bottom. Right. So we wanted to know what the patterns were in the water quality, how was chlorophyll changing over time? But more importantly, did we see common patterns where the lakes clustering together, were they grouping together in some way? So that was a global example. So, next, moving to a national picture, this is a project that we're currently working on, MOT for rivers. This is also funded by nerc, the Natural Environment Research Council. And we're looking at data now for rivers across England, Scotland and Wales. And we have data from the Environment Agency, from the Scottish Environment Protection Agency and from Natural Resources, Wales. Now, the specific question of interest here is about mixtures. Traditionally, when thinking about the water quality for rivers, you might look at individual chemicals or a small number of the chemicals. But here we now have the possibility to think about is it actually the mixture or the interaction of chemicals in the water that we need to carefully control? So we're looking at data from over 13,000 water quality sites that you can see in the bottom left, we've got information on nutrients, we've got nitrate, we've got phosphates, we've got metals. We're interested in how they're interacting with what we see in the surrounding catchment. So interactions with the land cover, with flow, with precipitation, with altitude, etc. And in terms of impact, it's all about the biology. So for a healthy ecosystem, we want to carefully manage any impacts on these biology metrics. My third example for now, takes us to a more local scale. So this is all about Glasgow. Glasgow is where I work, so I'm at the University of Glasgow and we are working at the moment with Glasgow City Council on, on this project Gallant, that I'll say more about later on. But here we're interested in the urban environment. So how is Glasgow's living space changing? What effect might interventions have on the future environment? And in this project, we're interested in non traditional data. So we might have more classical data sources, we might have official statistics, we might have deprivation indices, but equally we might have information from photos or from social media. So data, this is what we're interested in. This is the main focus of our conversation this evening. The data associated with these challenges can be both diverse and complex. For example, we talk about both structured and unstructured data. Structured data is maybe more what you would expect, numbers. So the level of chemical concentration in water or the number of species that we can observe, unstructured data, data that we might typically not be aware of. So it could be contained in photos or contained in text. And so we're interested in a variety of different data sources. These data sources might be recorded at different points in time, they might be recorded at different geographic locations, or they might be an average. They might be telling us about an area of the land, or they might be telling us about a monthly figure or an annual figure, and they could be available or collected in a variety of ways. So the data might be available for A particular design study, or it might just be automatically streaming. We all know that these days we're surrounded by data, everything that we do. So we're interested in how we can use multiple data sources in combination. And I've said potentially to provide additional information. And the reason I've said potentially there is because, of course, there's lots of different considerations when we use multiple data sources in terms of using them in an appropriate or sensible way. And so our main interest is to think about how we can use statistical modeling or data analytics approaches to enable us to do that. And for us, it's typically all about a question. So the approach that we take or what we're interested in doing is very much driven by the important question that we're trying to answer. And in all of my work that comes from the Real world challenge, what's the environmental question? What might be the data associated with that? And therefore, what is an appropriate statistical approach? So I thought I would take a step back now and just think about environmental data in general. Where do environmental data come from? How do we collect it? Well, the different sources that we have are continually evolving and expanding. As with many things these days, it's very fast paced in terms of the technology. If I take the water quality example, first of all, just to give some examples, we could have information from in situ. And so that's thinking about a classical approach where you might go to a river or a lake, take a manual sample that's then taken to the lab, analyzed, gives you information on the water quality. Or we might have an automatic sensor, we might have some kind of buoy that's giving us information that's maybe recording at a higher temporal frequency, so more information over time. We could have remote sensing. There's various different technology there. It could be satellite, it could be drone, it could be aircraft imagery. So giving us that wider picture over the land, that wider geographic representation. And then as we talked about a little earlier, now we have more non traditional data sources. So we could have information from citizen science people inputting information into apps on their phone, taking photos, social media posts or images, public opinion posts. It could be videos, or we might record audio. We could take, say, audio of biodiversity, sounds for animals in the environment. But typically there's no one data source that provides us with a complete picture. Typically we don't have one data source that we think gives us all the information. It's as accurate as it can be. So if we can combine our data streams, then it can enable us to unlock insights about patterns, changes or interactions and hopefully give us a greater understanding of our environment. So this is my main interest. Where does the statistics or the data analytics come in to all of this? Well, these are various different aspects of the data and the types of data sources that we've looked at so far. And we can use statistical modeling and data analytics in order to account for these different features of the data. So we mentioned earlier that data might be recorded at different time points or different geographic locations. We often refer to that as just saying recorded over space or spatial locations. Through using modeling, we can think about a sensible way to relate those data streams together. How can we combine the data streams that are giving us similar information, but they're not recorded at exactly the same point? We might have missing data over time and over space. So if we take the sensor example, a sensor can go offline in terms of WI fi connectivity. It can get kicked by an animal, it can get blocked by leaves, satellite. Depending on the instrument that we're working with, we might have issues of cloud cover, we might have issues of land masking. It passes over at particular times or particular object orbits. So we can take account of these different features. If we have data that are collected, say, over a river network that we've got here, we've got a lot of points on that river network. It looks like we've got a lot of information, but actually all of those points are connected by the network. So we need to take account of the fact that the data are connected. We don't have as many pieces of independent information as we think that we have. We've talked about data of different structure, structured or unstructured, and the way that that's arising might mean that our data are biased or they're not representative. Historically, statistics was based around design studies. We knew the properties of our data, but these days, typically we don't. And we don't want to not use the data just because we haven't collected it from a design study. But we need to be aware of the potential bias or the potential for it not to be representative. So within our approaches, we can take account of the data quality, the variability or uncertainty. So I wanted to go into two examples in detail, and this is to give you two examples of work that we've been doing and developing in this area. The first example we refer to as data fusion. And when we refer to data fusion, we're thinking about combining data where we're dealing with the same variable of interest. So here we're thinking about combining, for example, data recorded in situ on the ground with satellite data. And we've looked at various examples, water quality, wind speed, soil moisture. And I'll say a little about these. The second project is our gallant project and I mentioned that briefly earlier. Now this is where we're very much looking at non traditional data sources and thinking about how we can investigate sustainability. So first of all, what are the patterns in water quality over the lake and over time for Lake Balatun, Hungary? So the data that we've got here, we've got information from nine in situ locations and you can see those nine locations spread throughout the lake. We've got information from just over 7,500 pixels in terms of satellite data. And these data were meris. So the same instrument that I talked about earlier for MERIS, it gives you a spatial resolution of 300 meters. So we have a 300 meter by 300 meter pixel essentially of information each time we're thinking about chlorophyll that indicates of water quality. So higher levels, typically poorer. In this example, the in situ data are thought to provide more accurate information at a smaller number of spatial locations. And some of those locations as well have gaps over time. But we can see that from the satellite. We have the full color picture across the lake and we have information over time as well, but it's not thought to be as accurate as the in situ data. If we look at location one here, down the bottom, then we've got a pull out from location one over in the right time series plot here. So the black points are the in situ data, the manual recordings. The red is from the satellite and this is a matched pixel location. So we can see the difference in the, the variability for the two sets of information. Spatially you can see a difference in the color, so stronger, darker red colors in the in situ, in the middle there, for example, than we see in the satellite. So we're interested in how we can combine the in situ and the satellite data to provide a higher resolution prediction of over the water quality for space and in time. And the way that we do that is to think about the data as a curve. So these two pictures, the bottom left, the top right are the ones that you've already seen. And the bottom right, this is showing you the in situ data. So those points again from up the top here. But now we've put a curve through those points and this is the way that we are going to combine the in situ and the satellite data in order to use our modeling to do our data fusion for these products. Now, an equations warning Okay, I do have three slides of equations. If you are not an equations person, just ignore them. I thought the statisticians might like some detail. Okay, so first slide of equations. What we're doing is we have a curve over time for our in situ point. We have a curve over time for our matched satellite pixel. So this is our curve over time for our in situ. This is our curve over time for our matched satellite pixel. We then fit a model to describe the relationship between those two curves. And that's what's going on here. And in that model, we allow that relationship to change as it moves over the surface of the lake. There's an additional feature that we need to build in. That's the idea that the information across the lake is related to one another. So location one and location two are not independent. So we need to take account of the fact that we have a relationship through the water quality. So this is how we combine our in situ and our satellite data. And when we do that, we get a new set of predictions. So up the top here, we now have a new predicted surface. We've combined together that in situ and the satellite data. We have a more accurate product that we've got in terms of our information, and we also have an indication on the uncertainty. So this just gives us a range of values. So what we're saying is that at each point we have a prediction, but that prediction will lie in a range of values, so giving an indication of how accurate we believe it to be. And this is just showing you one spatial snapshot. Of course, we have this going through time as well. So what does that enable us to do? Well, such approaches enable us to investigate the patterns and the relationships. It's enabling us to predict anywhere across the lake and over time, but also giving us that uncertainty information. And one of the reasons to do this is to give us greater confidence in areas where we only have the satellite data because we can't take manual measurements everywhere across the globe. Another reason that we might be interested in here is to think about where you want to take new measurements in the future or where you want to place new in situ sensors in the future. So this was just one example. We've worked on various developments here and just a couple of aspects that I'll pick up to give you an idea. We've looked at how to scale this up. So I mentioned that for the MERUS data, the spatial resolution was 300 meter by 300 meter. But these days the technology can take us down to much finer spatial resolution. So down to less than 10 meters, for example. We also might want to incorporate additional data and so not just thinking about matching the same variable in situ. And for satellite. So two more slides of equations. Again, apologies if you're not an equations person. Bear with me. First one, just mentioning about scaling up or speeding up as we move to higher resolution in terms of the spatial data. So with more technological advances, we have much larger data sets that we're dealing with. I mentioned earlier that we need to take account of the fact that information across the lake will be related. Location 1 and Location 2 are not independent. And in the model, we did this through this particular line here. But this becomes very computationally intensive. If we have a lot of data, it can take a long time to fit. And so we've been working with various approximations so that we can estimate this using a smaller set of data and then scale it back up. Two other examples we've been looking at wind speed and we've been looking at soil moisture. The wind speed example. We're still thinking about relating in situ data to a lower potential, lower accuracy satellite data. We use a slightly different approach here, and this is so that we can capture more complex relationships. And we're also taking account of the fact that wind speed by its nature is a slightly different form of variable. So we can have a small number of extremes, we can have high values, but the majority of the data are at the lower or the more moderate values. So this approach allows us to take account of that. In the soil moisture example that we're working with, we want to build in additional information so we could get more accurate predictions. If we don't just look at soil moisture, but also build in, say, rainfall, temperature and elevation. But again, they are recorded potentially at different times, different geographic locations. And so we need a way in our modeling to account for that. So just a couple of the developments that we've been working on. Okay, if you're not an equations person, it's over, you can relax. Going back to something more general. So the second example, Gallant. So Gallant, Glasgow is a living lab accelerating novel transformation. It's a joint project with Glasgow City Council and again funded through the Natural Environment Research Council. Main aim up the top there to use cross disciplinary expertise to drive systemic transformation. Economic development is responsibly considered through ecological and social lenses. What does that mean? Well, we're interested in Glasgow's living space, as I said before, how it's changing, what effect interventions might have. This is a very large project and I'll go into more Details in a minute. But one of the ideas and the frameworks that we're working with in the project is something called Donut Economics, and that's related to the ideas that we're thinking about a space that is ecologically safe and socially just. So the QR code here gives you a lot more information on what I'm going to refer to as the Glasgow Donut. So the Glasgow Donut has not been developed by me, but has been developed by other colleagues within the team, and particularly our systems transformation colleagues. And the Glasgow Donut is thinking about these ideas of Donut Economics. So this is taking us towards a situation where we want everybody in society to be able to thrive, but we don't want to do anything in society that then overshoots our ecological boundaries that has negative impacts on the environment. So the idea is that we want to be in the ring of the Donut in order for that to happen, so that we can be ecologically safe and socially just so this is a bit more detail on our gallant project, Work Stream one I've just referred to. So this systems transformation, they're getting us to work in this whole systems approach where we're combining together all urban environment, society, and very much motivated by these ideas of the donor. We have another work stream on community collaboration. So my community collaboration colleagues are going out within Glasgow, directly, working with communities and citizens in Glasgow, involving them in the research, speaking to them about their perceptions of Glasgow, what they think needs to change, how we might put in interventions. And then I lead Work Stream 3, which is on data and data analytics. We'll come back to that in a little while. Underpinning all of this, we have five work packages. So work package one is broadly about flooding, Work package two is broadly about biodiversity, work package three broadly about vacant and derelict land. Four is about active travel, and five, clean energy at the community scale. And this is just to give you an indication of the size of the project. So a huge number of researchers working on this right across Glasgow, across all of our different colleges, science and engineering, social science, medical, veterinary and life sciences, and also our arms. So to tell you a little bit more about data. So again, another QR code here. There's lots of information, so please feel free to look if you would like some more. But this is the type of thing that we're interested in in terms of, is a bit different from the first example that we looked at for water quality. The data here a bit more diverse, non traditional. We're interested in photos, interview transcripts, deprivation indices, could be social media posts, citizen science, etc. So within our work stream, we're interested in cross cutting across those gallant themes, flooding biodiversity, etc. Thinking about specifically aspects of sustainability. And so we're developing data analytics approaches related to that and also developing our gallant data hub. So this is to let you see some of the approaches that my colleague Luigi has been developing. This is thinking about how we can extract information from these non traditional data sources. So we might be working with photos or social media posts, for example. Luigi's been developing different analytics pipelines. So how could we move, for example, from a photo to get a description of the that photo and then to think about what that description is telling us? Is it telling us something positive about the environment, something negative about the environment? Can we detect events? So just to give you one example of the type of thing we've been looking at, we've looked at the social media platform, Flickr. So from Flickr, Luigi managed to, through the process of curating a large vocabulary related to gallon, come up with 30,000 posts between 2019 and 2024, not just for Glasgow, but for Glasgow in the kind of wider central Scotland area. As you'll see in a minute, those Flickr posts are typically giving us photos. We've then used image captioning models to move from the image to a description and then used ideas of sentiment analysis to think about what is that description telling us about these aspects of sustainability. And then the next stage is all about linking that potentially to other data sources. So data zones, data deprivation and disease, etc. Okay, so one example here. So this is all of the Flickr posts represented by a dot 2019-2024. Glasgow is around about here where we see the bulk and we see we've just got a wider representation around central Scotland and a little bit to the north. Each of these dots was an image. The image has been converted into a description and then sentiment has been attached to that description. So the sentiment scale going from negative down the bottom to positive up the top. So the negative in the blues and the positive up to the red. So, so we can see one negative blue picked out there. So let's zoom in for a little bit more detail. So this is an example of the image that generated that blue dot, that negative sentiment. So in terms of the tags that were attached to the image, it was fallen, storm and tree. When the image was processed into a description, we got a tree that has fallen down in a yard. And you can see the representation. Now this image is AI generated This is not the image from Flickr. And the reason that we've done that is because we want to be really careful and cautious about anonymity. Just because somebody has posted this in Flickr, they didn't necessarily want me to use it in this talk. So this is just to give you an indication of what the original image looked like. This shows you some of that data aggregated up. So the specific areas that you can see that are shaded, these are data zones. And the definition of a data zone is that it's approximately 500 to 1,000 people. Within these data zones we've got the number of Flickr posts that we had. So this one here, for example, 151. This one here, 79. And these particular ones, when we've aggregated it up and looked at the number of posts, have picked out country parks that are around about the area of Glasgow. So for us that's quite nice in terms of validation. It's not a surprise that if we're interested in biodiversity, we might have a lot of posts from people that are around about those parks. Now this is really just a snapshot of the work, it's really just the start of the work because what we're interested in now is to how we link this to other data that we have. The QR code here takes you to communimap. So Communimap is an app that's been developed by our community consultation colleagues. In Workstream, two people can go onto the app and they can post about their surroundings, they can post photos, they can post routes that they take as they walk through the city. So we're interested in how we might combine what we see in social media to information that we get from the app to information that we get from smaller scale conversations. It's obviously very difficult to get a picture about citizens views of sustainability. So we're looking at these variety of different data sources to see do we get the same messages, do we get different messages, what could be driving that? Of course we know social media, Flickr, it's a self selecting group that have decided to post these images. So we need to be careful about the use. But how does it link with other information and how can we use that? The other thing that we're interested in is things like event detection. We saw in the picture earlier that we had detected the tree had fallen from the storm. We can see that if we could automate that type of process, it might really help in terms of planning, in terms of sorting out such issues instead of having to go Through a reporting system etc, etc. So lots more that needs done here but just the potential of things that we can think about that might feed in to future city plan. So with that I'm going to start to wrap up. Environmental monitoring is a fast moving landscape, rapid innovation in the potential data streams. A lot of information out there that potentially we can use to tell us about the environment. But there are challenges, there are considerations. We need to be careful in the use of the data. So we're interested in how we use statistical modeling, data analytics to combine the sources in an appropriate way. Always thinking about answering particular questions of interest that are really driving it all. But hopefully we might uncover unseen information and give us that richer picture to help inform monitoring, planning and management. Just very quickly about our interest for the future, we want to extend this. So we've talked about combining data streams. We can also think about combining models, combining approaches. There's a lot of work these days in foundation models from AI for the environment, so using things like LLMs ChatGPT to look at environmental forecasting. But how can we combine that with the benefits that we can get from from other modeling approaches as well? To finish up, I want to mention collaborators. I mentioned at the start that this is with many colleagues. I have done none of this work on my own and a lot of the experts are in all of these teams. So environmental science at University of Stirling, Environmental engineering at Edinburgh, Geography at Dundee, Environmental science at Lancaster, Meteorology and remote sensing at Reading James Hutton Institute UK Centre for Ecology and Hydrology Plymouth Marine Laboratory Environment Agency, Glasgow City Council. These are just the people that are involved in the work I presented today. I work with many other people beyond this and I won't read out all the names but a very large number of people involved directly at Glasgow and everything that I've presented today, both in terms of my own school in maths and stats, but also in the wider gallant projects and we're linking up very much with the urban Big Data Center. I want to acknowledge the Natural Environment Research Council. They have funded all of the work that I've talked about today in these three specific projects. And very important disclaimer in the blue. I very much apologize to anybody that I have missed everybody. I want to acknowledge everybody and I'll finish with some references if anybody wants a little bit more of the detail. Thank you very much.

C (44:42)

Thank you very much Claire for really fascinating presentation. I will try to think What I can add, actually, I wrote a few comments, but now I'm gonna have to adjust some of it. But first of all, no, no, thank you for that. So first of all, I'm an economist, so I'm gonna look at things in a slightly different. And I think it's kind of good to first explain how I look at things and how economists look at things and information, etc. So economists deeply care about information as we believe that information is one of the core foundation for efficiency. So without it, markets will allocate resources inefficiently, policies are unlikely to be efficient and individuals like all of us going to make progress decision. So we all care about it very much. In environmental economics specifically, I think that this is even more important because the environment is incredibly complex. So we really want to understand what's going on. And I don't think that it's likely that a single source of data will be able to portray the full picture. So this is why I'm really happy to be here today to discuss this. So I'm gonna the next few minutes, I'm not gonna take too much of your time. That's okay. I also start the teaching today at 9am I see a few survivors here in the crowd. Thank you for coming. So I'm just gonna think about, you know, share with you some thoughts on how combining different data sets might help us to better understand the environment. The challenges that are involved are gonna look more from an economic perspective. Perhaps some of it is very similar to what you were saying, but I think I'm going to be able to add a few more points. And importantly, I think I'm going to also try to say why I think all of this matters for, of course, research, but also for public policy and also for everyone that is here in the crowd today as individuals as well. So let me try to do all of that. I know it's quite ambitious in eight minutes or so that I have, but I'll try to do all of that and I'll try to make. To make sense of it. Right. So let me also warn you before I start that, you know, Claire was talking a lot about water quality. I'm the air pollution guy, so I'm going to give you lots of example about air pollution because this is kind of my. The majority of my research is around air pollution as well, which is kind of clear for your introduction as well, I hope. All right, so let me start with the fact that Claire already mentioned that not all environmental data capture the same spatial and temporal dimensions. So, for example, I told you, when we study air pollution, we can use different sources. Satellite data, for example, is fantastic in terms of the spatial cover that you get nowadays, it's really amazing. You can get data on air pollution pretty much anywhere around the world. Now, some of you might be. That's kind of obvious. But honestly, when I start working on air pollution, I couldn't even imagine a world where we can have data, really global data across everywhere that we want, which is really, really great. However, this is the big however, the spatial resolution in many cases is relatively core. So you mentioned some amazing data set that you have with water pollution, which I found incredible. We don't have that, to the best of my knowledge, with air pollution. So, yes, it enable us to kind of, you know, look at patterns of air pollution across countries, across cities as well. But if we really care about what happened within neighborhoods, for example, that's something that we are unlikely to be able to get from satellite data. And also if we even care about even variation within very, very small geography, for example, across the same road, it's unlikely something that we'll be able to pick up. And sometimes we have massive variation even within very, very small space. So you can try it one day if you just. If you're an analysis student or even if you're not, you can go and try to measure the pollution on Kingsway, which is one side of this building, and then measure the same time, if you have two monitors, the pollution on the other side of the building, which is much greener. And you can see that even in a very very, you know, close proximity, you are very likely to get very, very different results because Kim's Way is a very busy road, all right. And because of the structure of the city, wind corridors, everything around, and the decay function of pollution over space, of course we're going to get different results. And sometimes we really care about that because my office is in one side of the building and on the other, I'm not going to tell you which side it is. So this is one thing about the kind of the spatial scale. I'm an economist, but I'm in the geography department. I have to talk about that as well. And I'm going to come back to that, that later today. The only thing, the other thing that I want to mention is the time dimension, which is really important, and Claire kind of already talked about that. But satellite data tend to provide us with information in a specific point in time. So it's going to give us a snapshot, for example, at 10:30am in the morning, what is the pollution specifically? And then it moves somewhere else because we need to cover the entire globe. But sometimes we really want to know about how things evolve over the course of the day. For example, think about, we're living in London, I don't know if the pollution going to be the same during rush hour, for example, Especially if you're talking about pollution coming from road traffic and during the nighttime, et cetera. Okay. Or other parts of the day or weekends versus not weekends. So it depends, you know, the time dimension that we can, that we can get the information for is really, really important as well. Okay. In contrast to the satellite data that I just mentioned, we can also use monitoring station that are spread actually in London. We have a good, relatively good coverage of monitoring station across space. And that's going to give us an incredible resolution, sorry, detailed information and also accurate information. I would say in a great temporal dimension we can get information every second, every hour, every, you know, every day or every, whatever you want really. But you can actually get in many cases information even every second. So that's something that is going to be great. So you can already see how these two things together can already help. And it's not just that, you know, satellite data give us the, you know, the kind of the spatial dimension and the monitor give us the temporal dimension. We kind of need both them works together to get really better understanding of what's going on. Before I kind of talk more about spatial dimension in slightly different context, I just want to highlight that there are also many drawbacks for getting monitoring data. So Claire was already showing that with a beautiful picture of the lake that you don't get it everywhere. So obviously there is a slight spatial issue there. But in fact we have some studies specifically on air pollution that it's not that, you know, we just, we get this kind of the allocation of monitoring across space in a random manner. We actually have strategically, some actors strategically place monitors in specific location. So we have a non random allocation of monitoring across space and that will create a lot of statistical problem because we're going to have a non classical measurement error as we want to think about. So that's something that we need to think about. We need to be aware of that. So basically what I'm trying to say and what's been shown in previous studies is that in certain places policymakers specifically put monitors away from hotspots in order to be compliant with regulation. And we have some evidence for that as well. Okay. So I already mentioned that we are likely to see significant variation of pollution even within relatively small geographical areas. Like the example that I gave you with the two side of this building. But as an environmental economist, or as an economist more broadly, I'm not only interested in where pollution occur and the level of pollution concentration across space. I'm also and potentially even more interested how it affect people in the economy. Okay, and let me explain to you why I think about that and how I think about that with some example. Okay? So for air pollution, again, I'm gonna give you an example. For air pollution, what really matter is not around. It's not the concentration across space or how much emission we actually emit into the atmosphere. Rut really matters for our health, for our well being. And we know that the air pollution effect happens, but also affect education, crime, labor productivity, even the housing market. You name it. Air pollution affected. What really matter is exposure, is how much we actually expose to pollution, how much pollution we inhale over the course of the day. And measuring exposure is extremely difficult, right, because we talked about the spatial pattern of pollution. But we as human also move across space, right? So sometimes we are at home, sometimes we are commute to work or to school, sometimes we are in a public lecture, right? And we need to kind of get an understanding of the pollution in each part, in each one of those locations to really get a full picture of exposure. All right? So to get this understanding, we need lots of data and we need to combine all of this, all of this data together. Not just about, for example, indoor air pollution and outdoor air pollution, but also information about how we move as individual across space. And let me kind of give you example of what we used to do before and also why that might not be the right thing to do. So I think tradition, by the way, if I'm running out of time, let me know. Okay, so now the students here know that something that happened quite frequently. So I'm just going to give you a warning. So traditionally, the way that we kind of think about the relationship between air pollution and human health, for example, how we study that, we look at, we take some data about ambient air pollution from monitors that are, as I mentioned before, placed in certain parts of the cities, for example, and link it mainly to the residential location of the individual and try to get with some fancy statistical methods to get something that it might arguably be semi causal, something like that. I'm trying to be very careful here. I know where I am. So this kind of approach really depends on an assumption, that implicit assumption that we make that this information about pollution is this proxy that we get for exposure that we use ambient pollution to proxy exposure is good. So in a recent study here in London, we kind of looked at that among other things that we were interested in. So we said, okay, let's see if, for example, what's the monitor tell us ambient air pollution monitor pollution outside, whether it predicts well what we see indoors. So what we did, we teamed up with Camden Council here in London, just around the corner and we randomly approached residents of Camden to ask them to put a monitor, indoor air pollution monitor in their house. So to get as close as we can to a representative sample of the population in Camden, as much as we can. And then we collect data for two weeks before we intervene with something else, which I'm not going to have time to tell you later. And we just collected some information about the indoor air pollution to see what we don't know much about it. Is it bad? Is it good? I don't know. So we want to know some information and also to test with other data set with another data, tell us what's the pollution outside to see how good the data match with each other. And what we found that actually ambient air pollution in London pretty predict very badly the indoor environment extremely badly. And then we combine it with some other data set to really understand behavior with survey data that we collected and some other data that I'm not going to discuss now. You can read the paper if you want, it's available online. What we found that the reason is that because there are many, many indoor sources, even in London. So we found it's smoking, it's cooking and many, many other things that you, you might be interested. So using this data, these two different data source actually help us to understand that what we did before is not very helpful or it's good, but it's not that good. So this is kind of another thing that I wanted to mention here. So really combining these two data sets together really really help us.

C (58:18)

Bring me to almost my last point. Why better information matters and how we can actually use that in order to improve things as well. Okay, the main point is kind of how I started my comment is that better Information can lead to better decision making. Okay, so that's true for policy making, but it's also true for individuals like ourselves. Okay, so let me kind of give you an explanation, what do I mean? Why it's better for policymaking. And then I'm going to give you an example how it means what, how it can help individuals as well. So for policymakers, if we want to design an efficient environmental policy, we need really good data and accurate data on the marginal damages and the marginal abatement cost of pollution, of reducing pollution. Because if we don't have reliable data on that and we get something that, and we get lots of uncertainty as well, then we risk setting the wrong tax level and therefore we're going to end up in a policy that might be too weak or too costly. This is not something that we necessarily want. In fact, I would say that we definitely don't want that. Okay. And we spend a lot of time here at LSC thinking about that and kind of try to avoid, you know, trying to get the policy right in here. But again, it's not just for governments because maybe, maybe some of you work in government, but I assume that most of you, not individual, also need information, accurate information, to pay better choices. And just to kind of show you how better information in this context can help, I want to go back to the study that I described before. By the way, I forgot to mention this is with a co author, Professor Robert Metcalfe at the University of Columbia, Columbia University, New York as well. So Rob, if you're watching me, I remember to give you the credit as well. So in the same, in the same study in Camden, when we put pollution monitors in people's houses, we also run an rct, okay, randomized controlled run, in which basically the RCT was very simple. What happens when we provide people with good information, with really good information in real time? That's it. So what we did, as I mentioned before, we put pollution monitors in people's houses. For the first two weeks we just collected information baseline without any intervention. This is what enabled me to compare the outdoor and the indoor environment before we intervene with anything. Okay, this is really pure data. And then after two weeks, we had a treatment and control group, control group stay the same. We just keep collecting information, but the treatment group, we enable a screen after two weeks on their monitor that now they can see in real time their police pollution. And what we found was remarkable. We found that just providing the information, not telling them what to do, just providing them with this information, reduce pollution exposure in Their home where they are by more than 30%. And for households that suffer from a lot of pollution to begin with, it was much higher than that. This is on average, and this is time. This is during time that they're actually at home, not when they are at work or something like that. In fact, when they're at work, we don't see any effect, which is what we expect. So you can see how giving accurate information can really affect people's decisions and hopefully can lead to better health and I would say better economy as well, because we have lots of information that better air quality also affect labor productivity and also obviously better health is something that we all want. The last point that I want to mention, and I promise to finish after that, is that. And I think that Claire in some point were kind of. I think that you kind of mentioned that in passing, but I want to make it, I think more explicit here because economists like me always say more data are better. And this is true most of the time, but not all of the time, because data quality really matters. Okay, so, so it's great to combine all the data together. And as I mentioned and as Clara was talking about before, it really give us a better picture, really understanding things that we couldn't understand with one data source. But I can see a situation when we combine data and actually make things worse. So the point that I'm trying to make here is that before we use data, any source of data, we need to make sure that data is good. We need to make sure that the quality is up to the standard that we hoping for. So for example, and this is the last example, when we use satellite data and measurement measurement that's coming from station across London, for example, we can use these two different data set to see whether the data is good or not. So let me give you another example from a statistic that we conducted a few years ago looking at the relationship between population density and air pollution in the United States. In the United States, we have a problem of measuring, of the allocation of measuring station across space. This is kind of what I was talking about before, about strategic placement of pollution measurement. So we really want to use satellite data that give us the full coverage that we want, really wanted, because we didn't really want to miss any point in the US but we weren't sure how good the data is. So what we did is to check how good the data is by using another source of data, which is the monitoring data that we have on the ground. We know that a specific location at a specific time what is the true pollution level? And we just look at the correlation, the R square between what the satellite tells us, what the pollution on the ground actually is from the monitoring station we have that we trust. And we found that in the United States this correlation is extremely high. But then when we try to do a global study and we checked it in other parts, exactly the same product in other parts of the world, and we found the correlation is really, really low. So you need to be very careful and always verify the data source that. So just to conclude, I think that combining data helps not only to better understand the environment, but also to move from simply monitoring the environment to understanding how it interacts with people and the economy and actually potentially change people's behavior. And I think that this is definitely the future and we're going to do more and more of combining data. And again, we need to do it well. We need to be careful with the data that we use, which source we use and how we use it. Thank you very much.

C (79:38)

We can pick on some of the stuff that. Because I felt that there were a few questions there. I'll try. So I heard biodiversity, which I really like. So I'M going to try to answer that because I'm the first, so I can choose. So I think that there is a colleague at the University of Chicago, his name is Eyal Frank, Professor Eyal Frank. And he does amazing work on biodiversity and he gets, I don't know how amazing data on biodiversity so worth checking is work. But I agree that direct measure of biodiversity is something that is much harder to get. But what we can get, especially with satellite data, is something that is potentially related to biodiversity, which is about tree cover and forest cover. So that's something that many colleagues here at LSE work on and look at. And only because we have data from satellite data we can learn more about that. It's not just part of, it's not directly part of the biodiversity, but it's obviously related to that and this is kind of how they try to learn something about that. But, but yeah, I agree with this point about lack of sufficient data on biodiversity and I think that many bodies right now trying to, to get more data, it just the point, something that we haven't, I'm surprised that I haven't brought it up before actually is about the cost of getting data. Right. So someone in the online crowd said why it's not all free, by the way. I'm a big fan of providing as much information as we can for free and make it accessible to everyone. Actually in economics, if you publish in a good journal, you have to, if possible, have to submit all the data set and all the scripts for publication as well for everyone to be able to replicate and to use the data as well. But the point is to get information. Sometimes it's very, very expensive, specifically in areas that I think you have in mind. So we always try to think about how we actually do that in a way that one, we get data in where we want, but also get good data because if we're going to cut on cost, we might going to get not as good data as you might want. So this is kind of a partial answer to your question. I'll let my other colleagues to answer the rest of it. I took the easy one.