
ELO ratings work for chess (κ=0.92) but fail catastrophically for AI agents (κ=0.31). Random users aren't chess arbiters. Code quality isn't win/loss. We explore psychometric failures, cognitive biases destroying data validity, and why quantitative metrics (McCabe complexity, test coverage) achieve 2.18x better reliability than human preferences.
No transcript available.