
ELO ratings work for chess (κ=0.92) but fail catastrophically for AI agents (κ=0.31). Random users aren't chess arbiters. Code quality isn't win/loss. We explore psychometric failures, cognitive biases destroying data validity, and why quantitative metrics (McCabe complexity, test coverage) achieve 2.18x better reliability than human preferences.
Subscribe to your favorite podcasts and get free AI summaries within minutes of release.
Browse trending podcasts or search for your favorites
One click to follow any show — always free, no credit card
Free AI summaries delivered by email within minutes of release
Free forever · No credit card · Unsubscribe anytime
Never miss an episode of 52 Weeks of Cloud. Subscribe for free →
No transcript available.