The CGD Podcast: PovcalNet Unchained – Justin Sandefur and Sarah Dykstra
Date: June 23, 2014
Host: Lawrence MacDonald (Center for Global Development)
Guests: Justin Sandefur & Sarah Dykstra
Brief Overview
This episode centers on the "liberation" of the World Bank's PovcalNet data by Justin Sandefur and Sarah Dykstra, their reasons for doing so, and the ensuing debate over data openness in global poverty measurement. They discuss the technical, ethical, and political challenges of making crucial poverty data truly open, their methodology for data extraction, and the implications of new purchasing power parity (PPP) data for interpreting global poverty trends.
Key Discussion Points and Insights
1. The PovcalNet Data "Liberation"
- The Issue: World Bank’s PovcalNet is a widely-used, foundational public database for poverty and inequality analysis—yet its raw data is not fully open to researchers.
- The Motivation:
- Researchers using PovcalNet face practical barriers if they need comprehensive datasets rather than single point queries.
- “If you want to change all the poverty lines in the world for every country and every year by 10%,...that would take you hours...You would start over and spend more hours pointing and clicking and drawing out one number at a time.” – Justin Sandefur [02:43]
- The Solution:
- Sandefur and Dykstra, with the programming help of Sarah’s brother, Benjamin Dykstra, wrote a script to automate extraction, running 23 million queries over 9 weeks to reconstruct the whole dataset.
- “We just ran 23 million queries on the bank's website. Technically, computer code did it. And it took nine weeks.” – Sarah Dykstra [05:23]
2. Ethics and Politics of Data Scraping
- No Formal Request:
- "When you decided that you wanted to have the data, presumably...you contacted the bank?"
- “No.” – Sarah Dykstra [03:22]
- Bank Restrictions:
- There’s internal and external recognition that the underlying raw data is not broadly available, even within the World Bank's own staff.
- Clarifying Their Actions:
- They stress that their actions were about making public data more usable, not hacking into private stores.
- “...we’re really repackaging things that are in bits and pieces already in the public domain. This is not hacking in and getting anything that wasn’t already public.” – Justin Sandefur [10:11]
- Concerns Raised:
- The World Bank worried countries might misconstrue the scraping as a breach and become more reluctant to share data.
3. The Call for Open Data
Sandefur and Dykstra’s Three Requests to the World Bank:
- Embrace Open Data Standards:
- Make data available in machine-readable formats, not PDFs.
- “The data should be available in machine readable format for anybody to freely download on the web.” – Justin Sandefur [07:59]
- Post the Code:
- Publish the estimation code used to create poverty figures for reproducibility.
- “You can’t see the code that does that...you can’t actually get that code to replicate it.” – Justin Sandefur [08:11]
- Release Enough Microdata:
- Wherever legally possible, share the underlying microdata to allow full replication.
- “For many countries, the bank could legally release the underlying survey data. There’s really nothing stopping them.” – Justin Sandefur [08:51]
4. The Open Data Debate at the World Bank
- Not all World Bank data falls under the much-publicized “open data” pledge; PovcalNet remains a gray area.
- Tension exists between institutional commitments to openness and the fact that most data originates from individual countries, who have their own reservations.
5. The Purchasing Power Parity (PPP) Release and Its Impact
- PPP’s Role:
- PPP is the tool to compare real living standards across countries by accounting for price level differences.
- “Instead of using market exchange rates...we’d like to be able to measure the quantity of stuff that they can actually purchase.” – Justin Sandefur [13:21]
- Mashing Data:
- With newly-liberated poverty data and updated 2011 PPP values, they recomputed global poverty rates.
- “We did it quickly in the public domain, which is probably why the blog post generated a bit of attention.” – Justin Sandefur [15:33]
- The Result:
- When applying the new PPP rates, global absolute poverty appeared to “fall by half”—from 19.7% to 8.9%—overnight, an effect of recalculation, not reality.
- “We would be quick to put asterisks...around that verb ‘fallen,’ because nothing changed, nothing fell. This is simply a matter of using different PPP exchange rates…” – Justin Sandefur [16:22]
- On Backcasting and Interpretation:
- The World Bank routinely applies new PPPs retroactively, which changes reported poverty for previous years, further complicating real-time or historical comparisons.
6. Institutional Response and Future Directions
- Bank’s Caution:
- The World Bank’s research department expressed skepticism about the new PPP numbers, cautioning against hasty adoption.
- “The stance of the research department now is to say we don’t necessarily trust those new numbers and we will reserve our judgment…” – Justin Sandefur [21:06]
- Post-2015 Goals:
- With new global poverty reduction targets in sight, Sandefur and Dykstra call for open, transparent methodologies and data pre-commitment in setting future poverty lines.
Notable Quotes & Memorable Moments
- On the rationale for 'liberating' the data:
- “We live in a new era where...a million eyeballs can find lots of mistakes. And so let’s put all the data and the code in the public domain and open up that conversation.” – Justin Sandefur [06:22]
- On the magnitude of the data scraping:
- “We just ran 23 million queries on the bank’s website. Technically, computer code did it. And it took nine weeks.” – Sarah Dykstra [05:23]
- On the 'illusion' of global poverty halving:
- “There’s no sense in which anything really fell by half between Tuesday and Wednesday… This is simply a matter of using different PPP exchange rates to count.” – Justin Sandefur [16:22]
- On the importance of open standards:
- “Sticking it in a PDF doesn’t count.” – Lawrence MacDonald [08:05]
- On methodological challenges:
- “As soon as you move away from the reference year ... we’re just kind of faking it for the other years.” – Justin Sandefur [19:16]
- On where the debate is headed:
- “We would love to see the bank clarify the methodology in advance, clarify how they’re going to use future PPP rounds, put all of the code and the data in the public domain and pre commit to a process and make it as transparent as possible.” – Justin Sandefur [21:17]
Important Timestamps
- [00:42] – Introduction to the “liberation” of PovcalNet data
- [01:38] – Why researchers need better access to PovcalNet
- [03:22] – The decision to scrape without official request
- [05:23] – Scope and method of scraping (23 million queries)
- [06:22] – The push for open data and code
- [07:59] – Three formal requests to the World Bank
- [09:41] – The World Bank’s concerns and institutional tensions
- [13:21] – Explanation of PPP and its significance
- [15:33] – Combining PovcalNet and new PPP data
- [16:57] – The “halving” of global poverty—illusory or real?
- [19:16] – Challenges in backcasting PPP and interpreting poverty trends
- [21:17] – Looking ahead: towards methodological transparency with the post-2015 Millennium Development Goals
Conclusion
This episode provides an in-depth, behind-the-scenes look at the challenges of data transparency and methodology in global poverty measurement. Justin Sandefur and Sarah Dykstra’s story not only highlights the technical ingenuity required to “liberate” public data, but also the broader policy and institutional debates about what truly open data means in an era where rapid, reproducible, and transparent research is crucial to informed global development policy.
