Summary7 min read

Eye On A.I. – Episode #325

Guest: Phelim Bradley (Co-founder & CEO, Prolific)
Host: Craig S. Smith
Topic: Why AI's Future Depends on Human Judgement
Date: March 9, 2026

Episode Overview

This episode explores the crucial – yet often overlooked – role humans play in the AI development lifecycle. Host Craig S. Smith interviews Phelim Bradley, the co-founder and CEO of Prolific, a leading human data platform. Bradley discusses how high-quality, diverse, and verified human judgment is essential to the progress and reliability of AI systems, why the "human-in-the-loop" remains indispensable, and how Prolific is addressing quality and scale challenges in academic research and AI evaluation.

Key Discussion Points & Insights

1. The “Dirty Little Secret” of AI: Human Labor (00:00–05:59)

Human Labelers at AI's Core:
Both host and guest frame the conversation around the fact that AI systems, despite their technological veneer, rely heavily on large numbers of human evaluators and data labelers.

"It's kind of the dirty little secret of AI that it's built with humans... armies of human evaluators or labelers out there." – Craig S. Smith (02:55)
Historical Context:
References to milestones like ImageNet and Mechanical Turk highlight how crowdsourced human annotation powered early AI advancements.

"Mechanical Turk ... is a reference to a German automaton ... what they didn't know was there was a chess master curled up underneath ... So it's kind of like that, you have AI, but under the table there are all these humans working at it." – Craig S. Smith (03:39)
Prolific’s Origin:
Bradley describes starting Prolific in response to poor data quality and lack of verification in existing platforms, aiming for higher methodological rigor and user experience.

2. The Prolific Model: High-Quality Human Judgment at Scale (05:59–12:29)

Difference from Other Platforms:
Bradley situates Prolific between commoditized tools like Mechanical Turk and more managed services like Appen, focusing on quality and representativeness:

"Mechanical Turk was great at applications like ImageNet ... fairly commoditized task ... That's changed ... now the audience ... their background, their expertise ... really matter." – Phelim Bradley (06:29)
Behavioral Science Roots:
Prolific’s methodology is anchored in behavioral research, applying rigorous sampling and verification practices.
Business Split:
AI-related work makes up about half of Prolific's activity, but both AI and academic research share overlapping demands for representative, high-quality human data (08:15).

3. Vetting & Representativeness: Beyond the Gig Workforce (10:00–16:12)

Participant, Not Workforce:
Bradley distinguishes Prolific's flexible participants from traditional contract workers, stressing real-world diversity and supplementary income.

"Our participants ... reflect kind of real world users ... not the contractor style workforce ..." – Phelim Bradley (10:38)
Rigorous Vetting:
Layers include identity checks, repeat verifications, deep profiling, behavioral analysis, and qualification gating.

"Identity verification ... deep profile information ... behavioral assessment to validate that you're engaged, intent, attentive and trustworthy." – Phelim Bradley (10:38)
Scale:
Prolific hosts “a couple of million” registered participants, with “several hundred thousand active in any given month.” (12:48)
Recruitment Mechanisms:
Organic growth through word of mouth, targeted referrals, and community engagement.

4. The Spectrum of Human Judgment: From Generalists to Experts (14:21–18:56)

Demographic Breadth:
Sampling aims to match real-world populations (e.g., US, UK).
Types of Work:
- General audience: No specialization required, general consumer testing.
- Taskers: Screened/trained crowdworkers for more nuanced evaluation.
- Experts: Subject-matter specialists for high-complexity or domain-specific tasks.
Participant Onboarding:
Five-step process: basic info, background interview, skill declaration, identity/KYC, behavioral assessment. (16:23)

5. Project Types and AI Evaluation Evolution (18:26–27:09)

Project Range:
From high-rigor academic research and government studies to nuanced AI model evaluation.
Not Focused on Basic Data Labeling:
Prolific prioritizes projects where participant selection and rigor matter, ceding commoditized labeling to other platforms.
Case Studies:
- AI Security Institute Project: Assessing the persuasive power of top AI models via interactive, demographically-matched studies.
  
  "...how politically persuasive can these AI models be ... with a representative selection of an audience..." – Phelim Bradley (19:38)
- Humane Benchmark: Comparing model preferences across demographic lines with double-blind, A/B model matchups (see 20:15-22:10).
  
  "The ranking of models does change based on the demographics and audience behind the models..." – Phelim Bradley (22:10)

6. The Rise of Model Evaluation – Why Humans Still Matter (24:26–36:50)

Shift from Basic Labeling to Rigor in Evaluation:
Human evaluators now focus on subtler model assessments, beyond what can be reliably automated.
Trust in Benchmarks Declining:
Standardized datasets and benchmarks are increasingly gamed, making real-world, human-centered evaluation more valuable.

"There's such a strong incentive ... to unintentionally game these benchmarks... models are able to all pass with flying colors." – Phelim Bradley (28:40)
Enterprise Demand Grows:
As more companies build AI applications, many now turn to Prolific to determine which models perform best in specific use cases.
Human-in-the-loop Endures:
Even as AI agents become more capable, human input remains irreplaceable where ambiguity, subjectivity, or trust/safety is involved.

"The human judgment is where the alpha is. If you're looking to push capability ... you are not able to build an automated evaluator ..." – Phelim Bradley (35:07)

7. Industry & Platform Future: Humans and AI Together (37:37–40:47)

Human-AI Collaboration:
Human-Computer Interaction research is increasingly about Human-AI (or Human-Agent) Interaction—optimizing their combined strengths.
Prolific’s Own AI Usage:
Using AI/LLMs to improve matching of project needs to participants, streamline onboarding, and eventually automate more of the workflow.

"Increasingly products are evolving to be magic AI boxes..." – Phelim Bradley (38:51)
Vision & Roadmap:
Full-stack human data platform with rich toolsets for both sides. Current AI-assisted features include natural language expression of requirements and qualitative background assessment.
Compensation & Business Model:
Usage-based pricing; contributors generally earn supplementary, not primary, income; transparency for both researchers and participants.

8. Looking Ahead: Opportunities and Ethical Involvement (44:00–47:15)

Expansion Areas:
- Scaling as more AI models/applications require robust, trustworthy evaluation.
- Opportunities in polling and analytics layered on respondent data.
- Providing analytics and methodology support, not just raw respondents.
Humanoid Robots & World Models:
Prolific is exploring participation in data collection for embodied AI/robotics, including via VR and world models.

"We've done some very interesting work on integrating Prolific into virtual environments so participants can take part in data collections and in VR." – Phelim Bradley (46:30)
The Future of Human Judgment:
Millions are now or soon will be involved in shaping and evaluating AI systems worldwide—directly or passively—ensuring that human values and context are reflected in future AI.

Notable Quotes & Moments

On Prolific’s Mission:

"Our purpose as a company really is to accelerate the frontier of human centered or transformative research and AI." – Phelim Bradley (09:06)
On the Changing Nature of Evaluation:

"We want to bring a layer of objectivity and rigor to this evaluation, which I think is going to be particularly important in enterprise applications and particularly in regulated or sensitive domains like healthcare, finance, law..." – Phelim Bradley (33:21)
On the Irreplaceability of Human Judgment:

"Wherever there is ambiguity or subjective opinion required, human judgment is going to be in the development lifecycle for a long, long time…” – Phelim Bradley (35:41)
On Real-World Impact:

"Understanding which model is best for which context is going to be a more interesting question perhaps than which model is objectively state of the art on an overall basis." – Phelim Bradley (22:10)

Timestamps for Important Segments

00:00–05:59: Framing the problem: hidden human labor in AI
05:59–10:38: Where Prolific fits in the human data/labeling landscape
10:38–12:29: Vetting, verification, scale of participant pool
18:56–23:22: AI model evaluation case studies (persuasion/benchmarking)
27:09–29:43: Shift from benchmarks to real-world, human-informed evaluation
35:07–36:50: Future of evaluation: humans vs. AI agents
38:51–40:47: Prolific’s AI-powered platform features & roadmap
44:00–47:15: Future opportunities: polling, analytics, robotics, and VR data collection

Conclusion

Phelim Bradley argues passionately that as AI's technical capabilities grow, the importance of diverse, high-integrity human judgment—rigorously recruited and methodologically supported—has never been greater. Platforms like Prolific are not just “the chess master under the board,” but an essential partner in ensuring that AI systems are robust, trustworthy, and meaningfully connected to real-world human values and realities.

Prolific’s ambition: to become a full-stack, globally representative, and scientifically rigorous human data platform, authentically embedding the “depth and breadth of humanity” into the next generation of AI.

This detailed summary aims to encapsulate the key themes, breakthroughs, and debates of the episode for listeners seeking a comprehensive grasp of why and how human judgment remains central to the future of artificial intelligence.

Loading summary

Transcript65 lines

[00:00]
A
I have in past interviews with people working with human resources for AI is that it's kind of the dirty little secret of AI that it's built with humans or that a lot of the these armies of human evaluators or labelers
[00:22]
B
out there or computational biology. But while they're realized that there was core pain points in running high quality academically rigorous human subject experiments online. My name is Phelim Bradley. I'm the co founder and CEO of Prolific and Prolific is a human data platform. So we connect both the infrastructure, methodology and high quality global pool of humans and connect them with data collectors for purposes of research and human data for AI in particular. Post training and evaluation use cases started pretty fake when I was doing a PhD at the University of Oxford due to my PhD in bioinformatics or computational biology. But while there realized that there was core pain points in running high quality academically rigorous human subjects experiments online. So platforms that were popular at the time had problems of poor quality data and verification of the humans who were taking part in research. The tooling and infrastructure was extremely poor user experience for both sides of the platform, again impacting both the ease of use and the quality of the resulting resulting data. And ultimately it was a created a kind of a, an unhealthy platform and dynamic. We set out to build Prolific in order to solve that problem. And originally it was a side project while I was completing my PhD and grew the company to reasonable traction before completing my PhD. And the story since then really has been rediscovering that same core pain point across a range of different industries. How do I recruit high quality data online from an audience of highly verified trustworthy participants. This is pain point across research applications. So user research, polling, academic research, which was kind of our origin story and now more recently also human data in the AI development lifecycle. And I think we, we try to apply some of this methodological and academic rigor of the behavioral sciences to some of the problems in, in evaluation of generative AI applications.
[02:55]
A
Yeah, and this is an interesting area and I know you, some of your other interviewers have pointed this out as I have in past interviews with people working with human resources for AI is that it's kind of the dirty little secret of AI that it's built with humans or that a lot of the these armies of human evaluators or labelers out there. My experience or my knowledge of this goes back to interviewing Faye Li, who I just had on the program. But the first time I interviewed her was about imagenet which was this massive database, really the first image database for training supervised learning programs. And it was her data set that allowed Geoff Hinton to validate deep learning in 2012 with Alex Nutt and kick off the deep learning revolution. And she turned to mechanical Turkish to do the labeling of her images. At that point I hadn't heard of Mechanical Turk, but it's kind of funny because it's a reference to a 17th or 18th century, I guess 18th century German automaton mannequin that played chess and famous people came and played chess with it. I think Ben Franklin played chess to it with it and it would win and people were just amazed. What they didn't know was there was small statured chess master curled up in the cabinet underneath the chessboard who was working the mannequin with levers and things. So it's kind of like that, like you have AI, but under the table there are all these humans working at it. So that was Mechanical Turk. And then I worked for a long time and had on the podcast guys who had a platform called Labelbox. Are you familiar with them?
[05:08]
B
Yes, they're Label Box.
[05:10]
A
Yeah, they're a labeling platform and they sounded, their thesis sounded similar to yours that there isn't a good platform. They built this platform and you can plug in third party, you know, BPO teams into it or you can have your own labelers, but it's a platform that they don't manage the human resources or they didn't at that time. And then as you mentioned, I spoke recently to appen that fields teams, or rather coordinates teams to do reinforcement learning with human feedback on foundation models. So where do you fit in that tradition besides being the chess master underneath the chessboard?
[06:00]
B
That's a great, great question. A great, I think framing of, of, of the space. I think we fit most in the tradition of the Mechanical Turk. So trying to abstract away the complexity of dealing with the messiness of real humans in the real world and provide labs and data collectors, the kind of high quality data that they, they want. I think in some ways you could describe prolific as what could and should have become. And the, I think a key kind of shift in the market which we've invested heavily in. I think in contrast to a platform like Mechanical Turk is Mechanical Turk was great at applications like ImageNet that you mentioned where, you know, labeling cats in images. Is this a hot dog style questions any human would, would, would do. So it's a fairly commoditized task and as a result every human on the other end who's doing that labeling is somewhat fungible or replaceable with any other person. That's changed quite, quite a lot since the days of imagenet. And now the audience of the people who are annotating your data or providing your post training data, providing your RLHF data really matter. Their background, their expertise. And I think coming from behavioral research context, the representation or representativeness or the generalizability of the audience that I'd say is like where we have focused to date in providing this human intelligence layer, but also maximizing the breadth and depth of the audience that you're able to access through the platform with an aspiration that the best and depth of humanity is ultimately reflected and encoded in these AI models.
[07:56]
A
Yeah. How much of your business is working with AI models and how much is. Because I know you provide manpower or research manpower, I should say for academic researchers. How much of it is working with AI models?
[08:15]
B
Roughly 50. 50 in terms of the business, in terms of the investment and manpower we built behind the platform. Although I think crucially there we focus on areas where there is strong synergy between the two sides of the platform. So the audience choice and the representativeness and the ability to tap into real world participants in order to understand real world behavior is a shared requirement and kind of pain point across both industries as well as the quality of the, the data collection tooling and the infrastructure that we provide. I think there's two kind of core dimensions into data quality. It's the quality of the contributors or the participants who are providing that data and quality of the infrastructure and the methodology that you use in order to analyze and abstract that data. So those two components are core investments that we think that provide synergy across the markets that we the, that we operate in. And then obviously there's going to be differences in terms of like how we serve those customers, whether they're large enterprise, need deep integration into their on prem tools or they're a PhD student who needs, you know, easy, flexible, self serve, self serve access. And I think crucially, like the fundamentally the frontier of AI capabilities and AI research is AI models is a research problem. So there's a strong overlap between academic research universities and the talent who's driving forward the capability of these labs. So our purpose as a company really is to accelerate the frontier of human centered or transformative research and AI. And there's a lot of kind of shared, shared investment that supports both of those challenges.
[10:01]
A
Yeah. And a part of that investment and what you're selling is a vetted workforce. Right. I mean this isn't like Mechanical Turk, which has very little vetting as I understand. I've never used it, but it's kind of self serve. You sign up and pick a task and execute it and consequently the tasks are very simple. But you're further up the food chain dealing with more nuanced tasks. So you vet your workforce. How do you vet them?
[10:38]
B
First, I may be slightly correct the term workforce in that our participants are folks who participate in Prolific in a flexible manner and as a result we're able to reflect kind of real world users, either consumer samples or people who have full time jobs in other professions and are able to participate in Prolific in a ad hoc manner. So slightly different from the contractor style workforce of some of the other platforms that you mentioned. And the verification and the vetting side of things is one of our core investments. And core ip we have many layers of protection from the fairly obvious things in terms of doing identity verification of all of the participants. We know who they are. Not only do we know who they are when they sign up, but recheck this on a periodic basis so we know that person is still the same person. Deep profile information to make it easy to route the right tasks to the right person based on either kind of independently verified data points or qualification style gates that we put folks through in order to run them through an exam, as well as lots of behavioral analysis and understanding like how these folks are actually interacting within the data collection or experimental workflows. But I think the latter is increasingly crucial in the world of agentic fraud. So people being able agents or AI systems being able to replicate the behavior of humans very accurately is kind of a new threat to online data collection and primary data collection more generally. And one that we feel like we're, we're right at the cutting edge of being able to defend against while still providing the this scale of audience choice and global participation that we're able to power.
[12:30]
A
Yeah. And so the who what would be first of all, how many participants do you have on the platform? Not people using worth of workers, but the people providing the human talent. How many of them are there?
[12:48]
B
Yeah, so we, we have a couple of million people in the, on the platform in general with typically several hundred thousand active in any given, any given month.
[12:58]
A
Wow, that is, that's impressive. And I mean obviously you're not going through LinkedIn and reaching out to individuals. If you, you know, a million people, how are you, how are you reaching them?
[13:13]
B
Yeah, it's a combination of factors. The core growth of the platform comes through word of Mouth growth and referrals. So because we offer a, a positive experience and meaningful work and meaningful pay, this is something that people tend to share online, share on social media, share with their family and friends. That's a core growth engine for us. Secondly, complement that with some incentivized referrals. So where we have gaps in our audience that our customers want to tap into, let's say we're looking for, you know, a PhD in a particular aspect of biology, we typically will have a small number in our pool and we can use that to bootstrap a larger sample through folks network. And then third, we do community and event based marketing in order to supplement the first two channels. But I'd say our core growth and our core engine on both sides of the platform is the trust and integrity, high quality experience that we're able to offer. Meeting people to share this amongst their colleagues or their friends.
[14:22]
A
Yeah, and if I hear about it, I mean you mentioned PhD students, but you know, you certainly don't have a million PhD students on the platform. What's the range of sort of talent level and can you give a breakdown like we've got, you know, 80% are, you know, college educated doing simple tasks and 20% are PhD students providing more difficult responses.
[14:57]
B
I don't have the precise numbers to the hand, but I would say we, we generally optimize the overall platform to reflect the dimensions that exist within real world populations. So try to reflect the demographic breakdown that exists within a representative US sample or representative UK sample, et cetera. And then in terms of the AI work specifically, I'd say it's a, maybe even like roughly a third between general audience sampling. So this is kind of consumer based testing where the representativeness and the generalizability of the audience is key but no particular specialism or skills are required. And then we have Tasker based work where you're, we either qualify or train people within the crowd to be high taste evaluators or people who have some skills in general AI valuation and data labeling which requires a kind of a different level of attention and understanding of how these models work. And then the final third would be expert level workflows where the subject matter expertise that folks bring from their prior experience is the critical dimension of whether they are appropriate for the project.
[16:13]
A
Yeah, and so if I wanted to make some money on your project platform, what would I do? What's the onboarding process?
[16:23]
B
Yeah, to the onboarding process from a contributor or participant view is pretty, pretty straightforward. So it's a kind of a five Step process in terms of filling in some basic information, giving us a short interview on your background and experiences and where you think you might have unique skills or experience that would be valuable either to researchers or to the AI labs and then some verification of both your background. This KYC style of check know your customer as well as a shortcut about behavioral assessment to validate that you're engaged, intent, attentive and trustworthy.
[17:04]
A
Yeah, and then I'm on. And then is it a list of tasks that I can choose from or do you reach out when you get a query from a customer that wants a certain profile person working on their project that you pull together that cohort and then send that cohort an offer? I mean self serve or.
[17:29]
B
Yeah, yeah, it's self serve between kind of one off projects or tasks which you'll be notified for on your project, on your path, on your dashboard or longer running projects where you need to agree to a certain level of commitment upfront. Maybe these are going to last for several weeks or months and require a little bit more time involvement where perhaps there's going to be an initial assessment and then if you get past that initial assessment, you're able to participate in the. The longer running, longer running project. Yeah, it's kind of a double opt in marketplace. So every task is, you know, is advertised to you. You're able to assess whether or not that's something you're interested in at the reward that is is being offered and you trust that the you understand how your data is being being used. If you're interested you can then accept. If not you reject the place and that opens it up then to another participant on the platform to choose if they want to take part.
[18:26]
A
Yeah. What are the range of projects that people use the platform for? I mean again, you know, you've got PhD students that are doing research projects. What kinds of. Can you give us an example of that? And then on the low end, I guess the low end would be data labeling or. And what kinds of projects do you. Can you give us an example there?
[18:56]
B
Yes. So one of the beauties of the. Of the platform, I think why platforms like Mechanical Turf became popular in the first place also is, is the flexibility of use case. So we've run nearly half a million projects over the last year with a vast range of different requirements and outcomes resulting in thousands of publications as well as like many capability improvements to models that don't get as well publicized or cited. But maybe to give you a few recent examples, we actually don't do that. Much work on the, as you described, the low end of the data labeling. As I mentioned at the start of the conversation, we focus on high quality projects where the audience and who is behind your data is a critical component. And most of the more commoditized data labeling use cases don't have that requirement. May be a better fit for a platform that optimizes for lower cost of labor. So the types of projects we optimize for are maybe have a slightly higher complexity bar and a higher bar of methodological rigor. So we've run a project recently which was investigating the persuasive capabilities of state of the art models which was run by the AI Security Institute in the, in the uk. And this is really this. I think it's a nice example as a combination of the behavioral research skills that we bring and the capabilities we bring in model evaluation. It's really understanding from a safety perspective how politically persuasive can these AI models be in the real world with a representative selection of an audience. Another example is a project we've run ourselves called Humane, which is a user preference model evaluation benchmark. So there are popular platforms like LLM arena or Chatbot arena where in short, two state of the art models are pitted against each other and users need to select which model they prefer and why, and ultimately the preference of these models. These participants are analyzed and this results in a leaderboard of which models are preferred by users. One of the challenges with Chatbot arena is there's no control on the audience or the people who provides these judgments or preferences, which I think is analogous to running a political poll without controlling for the audience audience selection. So we've run some projects which take the same idea, battle Model A against Model B, but add some methodological rigor in terms of the demographic representativeness of the audience behind the leaderboard. All of the models are double blinded. Which essentially then allows you to ask the question of all of these models, is there a difference between the preferences of a US population, a UK population, between left leaning participants or right leaning participants? Is age a factor in how people rank the performance of these models? In which context do people prefer model A or model B? Adds a lot more nuance to what otherwise is a fairly simplistic leaderboard. And the short version of the results of that project is that the ranking of models does change based on the demographics and audience behind the models, which I think is going to be an increasingly important factor as these models are rolled out in the real world in different geographic locales, people with different Culture, using them for different use cases. Understanding which model is best for which context is going to be a more interesting question perhaps than which model is objectively state of the art on an overall basis.
[22:44]
A
Yeah, particularly as sovereign AI develops. I mean, these different regions or countries are developing models that are trained on local data, so they're more culturally attuned. But I'm interested in the Persuasiveness project. How did that work? I mean, can you describe that? I mean, you have people in your database and they talk to a model or they read outputs from the model or what's happening there?
[23:23]
B
Yes, there were, there were many manufacturers there and it was a, an experimental workflow. So essentially people were split into randomized conditions. And as a result, there was kind of a, an ad style test of whether participants, if they were exposed to a back and forth conversation with one of, I think about 20 models. And these models were instructed to persuade the participants to agree with one of the political stances using different rhetorical strategies. So the experimental methodology was fairly, fairly sophisticated. And then the participants were measured on whether they agreed with the issue before they were exposed to the model and the conversation and after. And then as a result, you're able to measure the difference in how different models and different strategies informed how much these participants changed their minds on the topics that were at hand.
[24:26]
A
Yeah, but the interaction between the participant and the model, is it like, you know, have a conversation with this model for 30 seconds and you know about some prompt, some topic, or is it that there's a prompt and the model responds and the participant reads the response, or is the participant actively engaging in a conversation with the model?
[24:59]
B
It was, I believe it was the latter. So an active engagement multi turn, multi turn conversation. I'm not sure if the, I don't have the exact constraints to mine. I think there was perhaps a minimum number of interactions that were required in order to accept a data point. But yeah, it was certainly more sophisticated than a. Just read the output of a, of a model. It was a kind of a live interactive style of experiment.
[25:22]
A
Yeah, but that's a really interesting use case. So you need a certain level of political awareness. I mean, how did you pull together that cohort?
[25:37]
B
Yeah, in this case, this, I think, goes back to the representation and the generalizability of the audience. So trying to get a sample of the real world so that the outcomes of that experiment are more likely to generalize to real world scenarios and to a real world context was a crucial criteria of that kind of project. We have the tooling built into the platform, which allows you to do census matched sampling, which means that the resulting participants you get broadly match the population on key criteria such as age, ethnicity, political affiliation, things of that nature.
[26:20]
A
Yeah, I mean you said you don't do much sort of labeling anymore, is that right? That's right.
[26:27]
B
We never did a huge amount of, let's say, the simple kind of commoditized style of labeling. One could call the post training evaluation data sets a form of data labeling.
[26:37]
A
Yeah, yeah. Actually that's what I was getting at. I mean, you know, in the early days of supervised learning, the focus was on labeling, particularly for computer vision. And then just in the last year, as these models have matured and become more widely used and as there are more models competing evaluation is sort of come to the fore. How long have you been doing model evaluation?
[27:09]
B
Yeah, we've been working in that area, I would say for two to three, three years. So the, I think post the ChatGPT boom, the shift of user driven model evaluation has shifted towards, as I said, kind of more complex audience requirements, more expert participants and a requirement for more kind of scientific rigor. And that's been a natural fit for our, for our platform capabilities, which was the original wedge within this space. We've obviously invested pretty heavily in additional capabilities and frameworks and the ability to kind of convert human judgment into useful signal for these model developers.
[27:55]
A
Yeah, yeah. And that's. Yeah, as the models, I mean evaluation on benchmarks has gotten sort of a. People have soured on that to a certain extent because you can, there are all sorts of problems. You can train your model to the benchmark or you know, the benchmark somehow ends up in the training data and all, all sorts of things. So people don't trust, you know, a model winning the Math Olympiad anymore. That, that at one time was a big deal. Now they're more focused on this human evaluation. Does that. Is the market like blowing up for you guys that end of the work?
[28:41]
B
Yeah, I think that is a key driver of our expansion in the market. This shift away from like the academic benchmarks or exam based benchmarks which the Frontier labs use to kind of advertise their relative performance. There's such a strong incentive, as you said, to, to either to unintentionally gain these benchmarks and leaderboards. So their usefulness is decreasing as they become saturated and essentially the models are able to all pass with flying colors. But then it's becoming harder to judge the real world evaluation. And I think that's where platforms like ours come to their own. In trying to simulate the real world environments in which these models and agents are interacting with real people, trying to achieve real problems and then evaluating that performance is ultimately the, the real, the real goal that's ultimately going to drive kind of the, the economic impact of that these models have in, in practice.
[29:43]
A
Yeah, you guys, you know, are part of an industry. You say you have. I've forgotten the number did I heard million. Was it a million or multiple millions or something on the platform? On the evaluation side, if you were to look globally, do you have a sense of how many people are involved in evaluation through platforms like yours?
[30:11]
B
I don't have a strong estimate of the global number of people involved in this type of work, though I would say it is, it's increasing. And also the types of people who are likely to be involved in it over the the next couple of years are likely to shift considerably as the labs kind of shift their focus towards knowledge, work and understanding the capabilities that these models have in kind of all of the areas of the economically productive labor. Folks within all of those industries are going to have some involvement in evaluating and providing feedback to the tools that they use day in, day out in order to start to improve their performance performance not only on these economic benchmarks, but actually how they're performing in real world conditions on real world, real world problems.
[31:00]
A
Yeah, I mean there's evaluation of the frontier models, that's one thing. But then there are all these applications built on top. Is it companies building applications that come to prolific to evaluate their applications or are there enterprises that are adopting solutions who want to evaluate Increasingly both.
[31:26]
B
I would say that the labs are leading the way as they have the talent in house to understand the methodology and the frameworks that are required to build out benchmarks. And the majority of labs will have in house teams and in house evaluations that are accustomed to their requirements and are not necessarily always one to one mapping to the sort of public, public benchmarks. But as enterprises start to build on top of this infrastructure and build agents for their specific use case or their specific context, they increasingly also need this evaluation not only of their own tools, but also because there is such a wide choice now of which open source or frontier model to use, even evaluating which is which base model to use in the first instance is as useful piece of data and is not necessarily obvious upfront. But then as they build their application in particular understanding I think safety and reliability and trustworthiness in real world conditions is often they have a higher bar for that kind of robustness in specific applications than perhaps the infrastructure layer is going to have.
[32:29]
A
Yeah, yeah. So enterprises, I mean, if I'm an enterprise and I'm building an application and I have a choice of, you know, eight different models to hit for my inference, you know, on through an API, I, you know, I've seen sort of casual rubrics where, you know, Claude is good for this chat, GPT is good for that and you know, Gemini is good for something else. But I mean, who knows on your specific use case. So are you having enterprises come to you and say we need to figure out which model works best in our application and do A B tests as you say, or ABCD tests or something?
[33:22]
B
Yeah, I mean, I think that's. You've precisely laid out one of the challenges that we want to tackle, which is many of this evaluation is based on vibes and based on intuition that domain experts have. We want to bring a layer of objectivity and rigor to this evaluation which I think is going to be particularly important in enterprise applications and particularly in regular, regulated or sensitive domains like healthcare, finance, law, things of things of that nature. And then on the second part of your question, that choice of base model is just the first step. Right. I think there's many more steps in then in terms of testing for safety and having a continuous flywheel of feedback both from automated evaluations. We've talked mostly about human judgment evaluations, but there's also a whole strand of technical and automated evaluations which is a complement to the human judgment. And then also understanding the delta between the performance and outcome that they're driving towards for their application and then the gap between that and their perhaps fine tuned model or customized model. And then critically the, the tooling and the products to help them bridge that gap between the capability that they're after and the kind of baseline capability that they get with the, the existing model.
[34:39]
A
The. You mentioned agents in the, in at the very beginning and in the context of ensuring that you're, you have a human on the other end doing the evaluation and not an agent, but the agents are being employed in evaluation and they're getting better. Is there a reason why your industry wouldn't eventually fade away as agents play that role?
[35:08]
B
Just to be clear, we are active users of agents and LLM as a judge and similar tools in order to augment human performance and make sure that we're using the relatively expensive component of human judgment in an intentional kind of cost effective way. I think there's a few reasons why. I think at minimum it's going to be a long time before the human component dies out. Firstly, I think the human judgment is where the alpha is. So if you're looking to push the capability beyond what is already available, almost by definition you are not able to build an automated evaluator that is able to assess that gap in capability. So you need human judgment. Human judgment. There, of course, in cases like Chess, the models have gotten to a stage where the human judgment is no longer kind of a net benefit to improving the models. But I think Chess and similar verifiable domains are a fairly small set of the useful tasks. And wherever there is kind of ambiguity or subjective opinion required, human judgment is going to be in the, in the development lifecycle for a long, long time as well as I think crucially the trust and safety component where I think at minimum having an escalation path where the model confidence is low, or where you want to cherry pick and just make sure that the models and evaluators are performing as expected. So certainly I think the human judgment and the human's role in this life cycle will change, I think, but the relative value that it provides is going to increase. That makes sense.
[36:50]
A
Yeah, absolutely. Yeah. And I didn't ask that question as a challenge to your viability. I'm just imagining the future. But that's, that is one of the really interesting things about AI. I mean these, these large models is they encode human knowledge off of text and there's a lot of ambiguity and a lot of misinformation included in the text. And then we'll ultimately need humans to guide the models. And it sounds, you know, you didn't come back with a number, but it sounds like there must be millions of people around the world engaged in this. Do you think that's an overstatement? Millions?
[37:38]
B
I don't think that's an overstatement for sure. And also I think the many more people are going to be involved, minimum, in a passive way as more and more of these models are rolled out in as, as products. And I think just maybe touch on your other, other point. We've worked in, with academics in the field of human computer interaction for, for a while. I think that field is, is shifting maybe towards like human AI interaction or human agent interaction. So I think there's a lot of work to be done in actually understanding like how to get the best of both humans and AI systems and how do you make the collaboration and the engagement between them better than the, some of some of its parts, as it were. This is not necessarily Kind of obvious how it's going to work APRI and I think will require quite a bit of intentional effort in order to understand how to develop systems that work well for humans and make sure these agents and systems are human centered and are supportive of folks. Folks work.
[38:36]
A
H. How much AI do you use in Prolific? I mean, either in, you know, managing the platform or sourcing participants or. Yeah, I mean, how is there AI in your platform?
[38:52]
B
Yes, and an increasing amount. I think there's an interesting fact that AI is quite a convergent force in, in technology in the sense that increasingly, I think products are evolving to be magic AI boxes where you type in your request and you expect the AI to do its magic and return its result. I think the very early version of the we've rolled out is on the data collector side, the audience requirement tool. So you just express who you're looking for in natural language and we go away and we use the data that we have on our participants in order to find the right people rather than you needing to really think about the and or requirements of how you select for those people. And similarly on the contributor side, instead of asking them a battery of questions to understand who they are, we're able to offer them a path where they're able to interact in natural language through audio and video and then we're able to use that richer qualitative data in order to extract what's important to the other side of the platform and improve that matching. And I think that abstraction will increase where the, the magic box will. You'll ask it instead of find me these people, it will be okay, and design me the data collection tool and please go away and actually execute on the, the work and then ultimately maybe automate the full workflow. So the researchers or data collectors kind of stay at the outer loop of the, the project kind of designing the, the hypotheses and the, the outcomes that you want. But increasingly the, the agents and the AI tools that we build allow them to automate a lot of the, the details.
[40:27]
A
Yeah. And you're doing that now or that's on the roadmap.
[40:32]
B
This is the roadmap. So the two examples I gave at the start of the define the right audience through natural language and the understanding of the contributor's background through AI interaction. Both of those are live, live products and I think early examples of more to come.
[40:47]
A
Yeah. How does the, how do the customers, whether it's an individual PhD student or a foundation frontier model company, how did they pay? Is this a subscription or pay as you go or.
[41:05]
B
Yeah, it's a fully usage based billing cycle. So we will recommend the recommended pay to the contributors and then we take a, a portion of that payment. But it's fully, fully usage based. So you can go from tens or hundreds of dollars through to tens of millions. And it's purely just the amount of data that you collect and the amount of time that you ask from the
[41:28]
A
contributors and how much of your business is kind of retail in that sense that, you know, people are coming onto the platform for a short project and then they're off. And how much is it? Big enterprises that have longitudinal studies going on that they need you guys for.
[41:54]
B
Yeah, we span the spectrum of budgets and wallet sizes and use cases. I'd say the vast majority of users have recurring use cases. If you're in the business of collecting data or running research, typically you have many projects or even if you move companies or universities bring your preferred tools with you. So yeah, we see the majority of customers come back on a repeated basis.
[42:20]
A
Yeah. You mentioned polling also. Do people use you to run surveys?
[42:28]
B
Yes, again this is a use case that is supported and again I think something that we're increasingly interested in as the complexities of running high quality polls is becoming more and more challenging for a variety of reasons. Harder to reach people, harder to incentivize people. So I think it's an interesting role that we can, we can play there.
[42:53]
A
Yeah. And on the compensation side, each project is the contributors are paid according to the project. So, so there isn't a standard hourly fee or task fee that they're earning. And is it, does it, is it enough that people, they can supplement their income? Certainly. But can people earn a living income as a contributor?
[43:22]
B
We don't optimize for folks who want to earn a living on the, on the platform. So primarily it's a side hustle or supplementary. Supplementary income. And yes, all of the projects or the pay is based on some combination of the length, complexity of the project and complexity of the requirements on the, on the audience, on the audience side. But all of that is made transparent to both sides of the platform. People have that. Yeah. Openness, ability to kind of, it's a, it's not a black box in terms of us going away and not showing how the sausages is made. The data collectors and the participants are able to interact with each other and see both sides of the platform as well.
[44:01]
A
Yeah. Where do you see prolific going is? I mean as I said there's, I would guess that there's growing demand as More and more companies release these models. But then there's also, it seems to me there would be opportunities, for example, on the polling where not only do you source the respondents, but you could run analytics on top of those responses. I mean, if you're, if the data's flowing through your pipeline. So what's, where do you see Prolific going?
[44:41]
B
Yeah, totally. I think there's exciting use cases across a range of applications and ultimately our ambition is to become a full stack human data platform. So provide a global, high quality participant pool where you're able to access both the breadth and depth of humanity. The tooling, whether that's through partnerships or through our own tool, to tap into those humans for a wide range of different applications and then supporting the methodology and the tooling really to advance the frontiers of research and AI. And yet much more work to be done on all of those strands. And yeah, many exciting opportunities both in the developing transformative or human centered AI and then also really understanding the how this is changing and influencing human behavior and really being the platform that provides this authentic, high integrity human data in the age of lots of AI generated data as well.
[45:42]
A
Yeah, yeah. And I know I'm a past the hour. Do you have time for another question or two?
[45:51]
B
Unfortunately I probably need to drop.
[45:54]
A
Oh, you have one more question?
[45:57]
B
Absolutely, yeah.
[45:59]
A
There's last year, in the past year there's been this boom in humanoid robots. But the critical block bottleneck for humanoid robots is data collection. You know, because with LLMs you've got the Internet full of text, but humanoid robots, you need, you know, data from humans manipulating objects or, you know, identifying things in scenes. Are you doing any work on data collection for humanoids?
[46:31]
B
Yes, it's not a, the, the core of the work is done on the platform to date, but is, I would say an area of open discovery in particular on. So you mentioned you had Lynn on the podcast recently and she probably talked quite a lot about world models being a kind of a required environment in order to accelerate the, the training of embodied AI. So we've actually done some, some very interesting work on integrating Prolific into virtual environments so participants can take part in data collections and in VR. So I think we've gone from kind of text, images, video, I think world models and virtual reality are an interesting modality that we'll definitely explore in the coming years.
[47:15]
A
Okay.