Summary6 min read

The CGD Podcast: PovcalNet Unchained – Justin Sandefur and Sarah Dykstra

Date: June 23, 2014
Host: Lawrence MacDonald (Center for Global Development)
Guests: Justin Sandefur & Sarah Dykstra

Brief Overview

This episode centers on the "liberation" of the World Bank's PovcalNet data by Justin Sandefur and Sarah Dykstra, their reasons for doing so, and the ensuing debate over data openness in global poverty measurement. They discuss the technical, ethical, and political challenges of making crucial poverty data truly open, their methodology for data extraction, and the implications of new purchasing power parity (PPP) data for interpreting global poverty trends.

Key Discussion Points and Insights

1. The PovcalNet Data "Liberation"

The Issue: World Bank’s PovcalNet is a widely-used, foundational public database for poverty and inequality analysis—yet its raw data is not fully open to researchers.
The Motivation:
- Researchers using PovcalNet face practical barriers if they need comprehensive datasets rather than single point queries.
- “If you want to change all the poverty lines in the world for every country and every year by 10%,...that would take you hours...You would start over and spend more hours pointing and clicking and drawing out one number at a time.” – Justin Sandefur [02:43]
The Solution:
- Sandefur and Dykstra, with the programming help of Sarah’s brother, Benjamin Dykstra, wrote a script to automate extraction, running 23 million queries over 9 weeks to reconstruct the whole dataset.
- “We just ran 23 million queries on the bank's website. Technically, computer code did it. And it took nine weeks.” – Sarah Dykstra [05:23]

2. Ethics and Politics of Data Scraping

No Formal Request:
- "When you decided that you wanted to have the data, presumably...you contacted the bank?"
- “No.” – Sarah Dykstra [03:22]
Bank Restrictions:
- There’s internal and external recognition that the underlying raw data is not broadly available, even within the World Bank's own staff.
Clarifying Their Actions:
- They stress that their actions were about making public data more usable, not hacking into private stores.
- “...we’re really repackaging things that are in bits and pieces already in the public domain. This is not hacking in and getting anything that wasn’t already public.” – Justin Sandefur [10:11]
Concerns Raised:
- The World Bank worried countries might misconstrue the scraping as a breach and become more reluctant to share data.

3. The Call for Open Data

Sandefur and Dykstra’s Three Requests to the World Bank:

Embrace Open Data Standards:
- Make data available in machine-readable formats, not PDFs.
- “The data should be available in machine readable format for anybody to freely download on the web.” – Justin Sandefur [07:59]
Post the Code:
- Publish the estimation code used to create poverty figures for reproducibility.
- “You can’t see the code that does that...you can’t actually get that code to replicate it.” – Justin Sandefur [08:11]
Release Enough Microdata:
- Wherever legally possible, share the underlying microdata to allow full replication.
- “For many countries, the bank could legally release the underlying survey data. There’s really nothing stopping them.” – Justin Sandefur [08:51]

4. The Open Data Debate at the World Bank

Not all World Bank data falls under the much-publicized “open data” pledge; PovcalNet remains a gray area.
Tension exists between institutional commitments to openness and the fact that most data originates from individual countries, who have their own reservations.

5. The Purchasing Power Parity (PPP) Release and Its Impact

PPP’s Role:
- PPP is the tool to compare real living standards across countries by accounting for price level differences.
- “Instead of using market exchange rates...we’d like to be able to measure the quantity of stuff that they can actually purchase.” – Justin Sandefur [13:21]
Mashing Data:
- With newly-liberated poverty data and updated 2011 PPP values, they recomputed global poverty rates.
- “We did it quickly in the public domain, which is probably why the blog post generated a bit of attention.” – Justin Sandefur [15:33]
The Result:
- When applying the new PPP rates, global absolute poverty appeared to “fall by half”—from 19.7% to 8.9%—overnight, an effect of recalculation, not reality.
- “We would be quick to put asterisks...around that verb ‘fallen,’ because nothing changed, nothing fell. This is simply a matter of using different PPP exchange rates…” – Justin Sandefur [16:22]
On Backcasting and Interpretation:
- The World Bank routinely applies new PPPs retroactively, which changes reported poverty for previous years, further complicating real-time or historical comparisons.

6. Institutional Response and Future Directions

Bank’s Caution:
- The World Bank’s research department expressed skepticism about the new PPP numbers, cautioning against hasty adoption.
- “The stance of the research department now is to say we don’t necessarily trust those new numbers and we will reserve our judgment…” – Justin Sandefur [21:06]
Post-2015 Goals:
- With new global poverty reduction targets in sight, Sandefur and Dykstra call for open, transparent methodologies and data pre-commitment in setting future poverty lines.

Notable Quotes & Memorable Moments

On the rationale for 'liberating' the data:
- “We live in a new era where...a million eyeballs can find lots of mistakes. And so let’s put all the data and the code in the public domain and open up that conversation.” – Justin Sandefur [06:22]
On the magnitude of the data scraping:
- “We just ran 23 million queries on the bank’s website. Technically, computer code did it. And it took nine weeks.” – Sarah Dykstra [05:23]
On the 'illusion' of global poverty halving:
- “There’s no sense in which anything really fell by half between Tuesday and Wednesday… This is simply a matter of using different PPP exchange rates to count.” – Justin Sandefur [16:22]
On the importance of open standards:
- “Sticking it in a PDF doesn’t count.” – Lawrence MacDonald [08:05]
On methodological challenges:
- “As soon as you move away from the reference year ... we’re just kind of faking it for the other years.” – Justin Sandefur [19:16]
On where the debate is headed:
- “We would love to see the bank clarify the methodology in advance, clarify how they’re going to use future PPP rounds, put all of the code and the data in the public domain and pre commit to a process and make it as transparent as possible.” – Justin Sandefur [21:17]

Important Timestamps

[00:42] – Introduction to the “liberation” of PovcalNet data
[01:38] – Why researchers need better access to PovcalNet
[03:22] – The decision to scrape without official request
[05:23] – Scope and method of scraping (23 million queries)
[06:22] – The push for open data and code
[07:59] – Three formal requests to the World Bank
[09:41] – The World Bank’s concerns and institutional tensions
[13:21] – Explanation of PPP and its significance
[15:33] – Combining PovcalNet and new PPP data
[16:57] – The “halving” of global poverty—illusory or real?
[19:16] – Challenges in backcasting PPP and interpreting poverty trends
[21:17] – Looking ahead: towards methodological transparency with the post-2015 Millennium Development Goals

Conclusion

This episode provides an in-depth, behind-the-scenes look at the challenges of data transparency and methodology in global poverty measurement. Justin Sandefur and Sarah Dykstra’s story not only highlights the technical ingenuity required to “liberate” public data, but also the broader policy and institutional debates about what truly open data means in an era where rapid, reproducible, and transparent research is crucial to informed global development policy.

Loading summary

Transcript79 lines

[00:01]
A
SA.
[00:32]
B
Welcome to the Global Prosperity Wonkcast. I'm Lawrence McDonald. With me in the studio today are Justin Sandifer and Sarah Dykstra. Welcome to the show, both of you.
[00:41]
C
Thank you.
[00:41]
A
Thanks.
[00:42]
B
And we have a saga to share with you that involves the. I was going to say theft in the night of data, but maybe I should say the liberation of data from the World bank and subsequent wonk war that broke out that I think you may have set a record here at CGD for 24 comments Back and forth, including comments by Justin and by Martin Revalian, who's a non resident fellow here at the center, one of the world's leading experts on poverty and measurement who took a very different view than Justin did. I'm going to be totally agnostic here. I don't have a view about these things. I barely understand what is at issue. But suffice to say that tempers flared around this and it began when Justin and Sarah decided that they were going to scrape the World Bank's POVCAL website of data. Justin, unpack that for me. Why did you decide that you needed to scrape the data off the site?
[01:39]
C
All right. Well, the World Bank's povcal website is, I think, one of the big kind of public goods in development research that the World bank provides. They take all the available household survey data on income and consumption in developing countries, harmonize it as best they can and provide this kind of global database that you can use for poverty and inequality analysis. And it's all up there kind of underneath the World Bank's website. And you can query the website. They've provided a public tool. You can query the website and ask it specific questions. It's a great resource, It's a nice user face. But for a lot of researchers, they'd like not just to query one or two things, they'd like the whole data set.
[02:21]
B
So when we hear that there are a billion people who live on less than $1.25 a day, underpinning that statement is povcal.
[02:29]
C
Exactly is this thing called pov calnet.
[02:32]
B
And researchers all around the world want to use those numbers, poverty researchers for a whole variety of purposes. Can't they just run the query on the website and find out what they want?
[02:44]
C
They can if you want to know something simple like what's the poverty rate in Brazil? If I set the poverty line at $3, it's the perfect tool for that. But if you want to change all the poverty lines in the world for every country and every year by 10% that's a silly example, but just to pick something that would take you hours. And if the next day you wanted to do something slightly different, you would start over and spend more hours pointing and clicking and drawing out one number at a time.
[03:12]
B
So, Sarah, I gotta ask. When you decided that you wanted to have the data, presumably before, before you scraped the website, you contacted the bank and said, could we please have the data?
[03:22]
A
No.
[03:26]
B
You didn't ask.
[03:27]
A
Maybe you can explain the politics there.
[03:30]
C
I think it's sort of widely known, both within and outside the bank, that it's not available. The underlying raw data isn't even available to many researchers within the World Bank. It's available to a narrow set of people in the poverty unit in the World Bank.
[03:46]
B
So at what point did you think, bing, if we write enough code or the right kind of code, we can run enough queries that we can suck all the data out? Is that immediately obvious or. It took a while to think of that.
[03:58]
A
That took us a while. Initially, I was the one who was pointing, spending hours pointing and clicking on the website.
[04:05]
B
This would be the research assistant rather than the research fellow who's got to put in these long hours, huh?
[04:11]
A
And Justin came up with the idea of writing a script to run a bunch of preset queries that we were developing as part of a paper.
[04:24]
B
That sounds easy. Lots of people could write that, I guess, Right? Or maybe not.
[04:28]
A
No, not even I could write it. So we turned to Benjamin Dykstra, who's.
[04:32]
B
An independent programmer who happens to be your brother.
[04:35]
A
That's right.
[04:37]
B
And what did you ask him to do?
[04:38]
A
So he wrote that script, and we used it several times to run these alternate poverty lines that we were looking at. And it became clear that it would be even more useful to have the full data set to work off of rather than to continuously run this script.
[04:59]
B
The first sentence in the blog post in which you released the resulting data set to the world, I should say the. The post is called Global Poverty Data. Should be a data set and three requests. But the thing that caught my attention was the first sentence. It says, we just ran 23 million queries on the bank's website. Technically, computer code did it. And it took nine weeks.
[05:23]
A
That's right. We had a dedicated computer that was running the script for over two months. So what it was doing, it was hunting through the data to find each individual point. And although we only ended up with somewhere around 8 million individual data points, we had to search 23 million times to find them.
[05:47]
B
So a bunch of the data is coming back to you again and again and again. And then the ones you're looking for, the ones you don't have.
[05:52]
A
Yeah.
[05:53]
B
And then you have to reassemble the entire data set or the code helps you do that, put it all back together again.
[06:01]
A
So that was on my end. And that all of that code can be found on our dataverse, which is.
[06:09]
B
I think part of the point here is that research organizations, especially those funded with public or charitable dollars, should not only release the data itself, but release the code that underpins it, which is what you and Justin have now done.
[06:23]
C
Exactly right. And that's kind of our biggest ask from this is for. There's no implicit, we want to make clear, there's no implicit Criticism of the POVCal net data set or the World Bank's poverty numbers. And what we're doing, we're just trying to point out that we live in a new era where there's a lot of people who are interested in participating in this analysis, in this conversation, and a million eyeballs can, you know, find lots of mistakes. And so let's put all the data and the code in the public domain and open up that conversation.
[06:55]
B
If I'm recalling correctly, about a year ago the bank made a big announcement of open data. They were going to put all the data out there and make it available. I guess this applied mostly to the World Development Indicators, parts of which had previously been available for sale or free, and they pushed it all out there. What did povcal Net get an exemption from that or why was its data not in the public?
[07:16]
C
That's a very good question. And I think there's a lively internal debate in the World bank about whether or not this data should be public. But not all things, not all data that the World bank has are covered by the open data policy. And we discussed that a little bit in the paper. It's unclear, reading the policy, which things are covered and which things aren't. But it was pointed out to us that this is not.
[07:40]
B
We're almost done with chapter one about the liberation of this data. After the break, we're going to look at how Sarah and Justin used the data. But first, Justin, I want to just really quickly run through the three requests that you have made of the bank concerning the global poverty data. Number one, embrace open data standards. What's that mean?
[07:59]
C
That means that the data should be available in machine readable format for anybody to freely download on the web.
[08:06]
B
So sticking it in a PDF doesn't count.
[08:08]
C
Exactly.
[08:10]
B
Post the code.
[08:12]
C
Right now, the World bank uses underlying household survey data to produce parametric distribution, which is really what you're looking at when you go to POV calnet. There's a bunch of underlying estimation even to get the data that you see on povcalnet and you can't see the code that does that. And then to create global poverty numbers, there's a bunch more complicated analysis that goes on, done very carefully documented in a number of different papers, but you can't actually get that code to replicate it.
[08:46]
B
Number three, release enough microdata to recreate the estimates.
[08:51]
C
For many countries, the bank could legally release the underlying survey data. There's really nothing stopping them. For other countries, it's more complicated than the country might object. But wherever possible, we are trying to encourage the bank to put what they're using in the public domain and possibly even to limit themselves in their own analysis to using the data that they feel comfortable sharing with the rest of the world.
[09:19]
B
So, Justin, what has been the Bank's response to this? And by the bank, I think we really mean those researchers, those colleagues who are creating and using these numbers, many of whom are known well to you. It's not necessarily this huge institution per se. It's people. What have those people said?
[09:42]
C
There are a number of concerns about us scraping, you know, 23 million and doing 23 million queries. I think the most, in my subjective view, legitimate concern would be that we would scare off countries that contribute their data to the World bank and to POV calnet. Countries that, for whatever reason are very sensitive about the confidentiality of their data might misinterpret this paper as having hacked into the World bank servers and stolen data that wasn't public. So that's something we've tried to really clarify in the final version of the paper after consultations with friends at the bank, is that we're really repackaging things that are in bits and pieces already in the public domain. This is not hacking in and getting anything that wasn't already public.
[10:33]
B
And I imagine having worked inside the bank, in fact, in the development Economics vice presidency, where the data group and the research group residency, that there's a lot of sort of chewing over of this and probably a lively internal debate about what to do.
[10:49]
C
I think that is correct. There are people who would like to see all of this data in the public domain. There are people in the World bank who would like themselves to have better access to parts of this data. There's a tension that I think we can't deny, a legitimate tension between The World bank having announced its commitment to open data, but the bank not being the producer of very much of this data, ultimately the bank, you know, it's a pass through. Countries produce data, they report to the World bank and the World bank shares its analysis with the world. So it's going to take a while for the World bank to start insisting in negotiations with countries that they will have to publish the data when they publish their analysis.
[11:33]
B
We're going to take a quick break. When we come back, Justin and Sarah, I want to ask you about the blog post that went up while I was away. Global absolute poverty fell by almost half on Tuesday. Seems unlikely. We'll be back in a bit. This is Lawrence McDonald with the Global Prosperity Wonkast. My guests are Justin Sandifer and Sarah Dykstra. And we're talking about their effort to liberate 8 million pieces of data from the World bank, which they have done. It's now on our, on our website. And if you want to go get those numbers and crunch themselves, knock yourself out. Welcome back to the Global Prosperity wonkast. I'm Lawrence McDonald. I'm with Sarah Dykstra and Justin Sandifer. We're talking about how they liberated povcalnet, the World Bank's poverty database, and put it out in public domain. Justin, after you did that, along came a release from the bank of something that the wonks here call ppp, which is purchasing power parity comparisons. And I'm going to give my simple explanation of it. If a McDonald's hamburger costs five bucks here in Washington and you can get this roughly equivalent hamburger for two bucks in New Delhi, then you adjust for that difference in purchasing power so that if somebody had $2 in Delhi or $5 here, their level of welfare is seen as about the same because they can both buy hamburger. That's the dummies version. What did I leave out?
[13:22]
C
I think that's a good, that's a good explanation. Instead of using market exchange rates to compare people's income, we need to recognize we'd like to be able to measure the quantity of stuff that they can actually purchase. And to do that, you need to adjust for the differences in price levels between countries, not just for hamburgers, but for the whole basket of stuff that they consume. And it's a really difficult technical exercise because people consume different things in different places. You're trying to aggregate across lots of different prices from hundreds of countries. It's a huge undertaking.
[13:54]
B
And the people who do this is the International Comparison Project.
[14:00]
C
Exactly. The ICP and I was going to.
[14:02]
B
Say this is a bank unit, but maybe, maybe not. It's under dispute now. This is the World bank unit or not the icp.
[14:08]
C
That is a topic of some dispute. It is located physically within the World bank, but I think perhaps not under the direction of the World Bank Research department. So there's some rivalry or disagreement there that I, to be honest, I can't fully explain that.
[14:26]
B
Be that as it may, the ICP releases this purchasing power, parity, the PPP data periodically. It's not every year.
[14:35]
C
It's been historically every six years. And they're going to try to move now to an annual release, but this is the first. What we had here last month is the first release in six years. So we got new numbers that apply to 2011. The previous numbers we had applied to 2005.
[14:56]
B
So this might or might not sound like a big deal to you, but Justin was telling me before the show, it's a big deal to. To poverty researchers all around the world. They're waiting for this data. Some of them stay up when it comes out and work through the night to apply it to Things like the POVCal data to come up with new poverty estimates. So having, I keep wanting to say stolen, having liberated the data, the povcal data from the World bank, put it in a public server, lo and behold, along comes the new PPP data. The logical thing is to mash them together and figure out whether global poverty has changed or not in the last six years, which is pretty much what you did.
[15:34]
C
Exactly. So estimating poverty, basically you can think of it as two ingredients. You need the distribution of income or consumption in a country, and then you need an exchange rate to compare across countries. And every six years, the ICP gives you new purchasing power, parity exchange rates. So we had the POVCal data set, you know, the distributions of income and consumption. Along come the new PPPs, you combine them. We did that. 48 hours later, we put up our blog post. By that time, we know that people at DFID in the UK and USAID here in DC had already privately done it themselves. But we did it quickly in the public domain, which is probably why the blog post generated a bit of attention.
[16:16]
B
And you found, combining these two massive data sets, that poverty had fallen by half.
[16:23]
C
And we would be quick to put asterisks or quotation marks around that verb fallen, because nothing changed, nothing fell. This is simply a matter of using different PPP exchange rates to count. So this is an illusion on paper, but yes, if you Take the new PPP numbers seriously and you apply them using the same poverty line in US dollars, $1.25 per day in $2005, then the global absolute poverty rate is 50% lower.
[16:58]
B
And I'm looking here at the Post, it says that absolute poverty rate, I.e. the 125 a day, fell from about 19.7% to 8.9%. If something like that were remotely close to true, it would be huge, right?
[17:20]
C
It would be huge. And we do think it's important now there's.
[17:23]
B
Do you think there's some real progress underlying this? I mean, did people's lives of the very poor get better in some substantial sense over the last five years? Do we have any way to know other than looking at these numbers?
[17:36]
C
Well, we have lots of other ways of measuring progress in terms of human welfare around the world. We have child mortality numbers, we have education numbers. And so our colleague Charles Kenny has a whole book on things getting better and the rapid progress we're making.
[17:50]
B
So Charles would look at this and say, darn tootin, it fell by half, right?
[17:54]
C
Exactly.
[17:54]
B
I've been trying to tell you guys life is getting better, but there's no.
[17:58]
C
Sense in which anything really fell by half between Tuesday and Wednesday on April 29, April 30.
[18:03]
B
But between the six years prior PPP data and now, there could have been a substantial change.
[18:09]
C
There could have been a substantial change. There are also methodological differences between the two PPP series. So it's really hard to make comparisons over time here. What the World bank has historically always done is when a new PPP series comes out, all of history gets erased and those new PPP numbers get applied backwards to all the previous years. So you don't change the rate of progress really at all, you're just adjusting the level.
[18:38]
B
I'm thinking about that one. I'm thinking that that's what we do with our Commitment to Development index. When we make a methodological change, we back cast all the numbers. But it seems to me that backcasting PPP is weird because it's reflection of relative prices in the present. That might not mean anything about, in fact, by definition wouldn't mean anything about price levels in the past. With the cdi, it kind of makes sense because we decide, you know, we're going to count Iraq, we're not going to count Iraq or something like that. It's okay to back cast it, but these are price levels, they change over time. That's sort of what they're all about, right? Am I Missing something here?
[19:17]
C
No, I think that's a very astute observation. So what, what everyone is forced to do is to use countries own CPI as their consumer price index to adjust prices within a given country over time. But as our colleague Arvind Subramanian actually has a whole paper pointing out that as soon as you move away from the reference year, it used to be 2005 for the PPPs and now it's 2011. We have global measures of prices for those years, but as soon as you move away from that year, there's really no sense in which we have comparable price series across countries. We're just kind of faking it for the other years.
[19:57]
B
So what was the response inside the bank when you took the newly liberated povcal data and their newly released PPP data and mashed them together and found that the numbers seemed to suggest that global poverty had fallen by half. Not since Tuesday, but in the last six years.
[20:17]
C
Annoyance is probably the right word. There was a very interesting blog post that the chief economist Kaushik Basu at the World bank wrote without citing our blog post, but I would say clearly reacting to where he encouraged people to use caution and to hold off on embracing these new PPP numbers. And so as I mentioned before, there's kind of, you could say that the ICP that produced the PPP is part of the bank maybe, but there's also the research department in the bank. And I think reading between the lines, the stance of the research department now is to say we don't necessarily trust those new numbers and we will reserve our judgment until sometime in the future about whether we will use them or not.
[21:07]
B
This has been a fascinating conversation for me. I learned a great deal. I hope listeners could follow along. One final question, Justin. What Next?
[21:17]
C
Well, the post2015 Millennium Development Goals are being worked out right now. The first goal almost undoubtedly is going to be reducing absolute poverty. It's now the World Bank's kind of mission statement, ending poverty. That's going to mean defining probably a new poverty line with these new PPP numbers. So that's where the debate is headed. What is the right poverty line? Now we have some concerns and would love to see the bank clarify the methodology in advance, clarify how they're going to use future PPP rounds, put all of the code and the data in the public domain and pre commit to a process and make it as transparent as possible as possible. So that's what we'll be pushing for going forward with the MDGs and the poverty goals.
[22:05]
B
Well, I'll be watching with great interest. I'm Sarah. Justin, thanks for joining me on the show.
[22:10]
A
Thanks.
[22:10]
C
Thank you.
[22:11]
B
This has been the Global Prosperity Wonkast from the center for Global Development. My guests today are Sarah Dykstra and Justin Sandifer, and we've been discussing their effort to liberate the bank's POV Calnet. I should say their successful effort. The data is now available on the CGD website for download and the uses they put it to in mashing it together with the purchasing power parity data and finding that, lo and behold, global poverty fell by half. You can find the wonk cast online on itunes and on Stitcher. Just search for wonkcast or CGD and sign up to hear a new interview every week. Until next time, I'm Lawrence MacDonald. Thanks for listening.