The Testing Psychologist Podcast
Episode 501: How to Be More Confident in Our Data (w/ Dr. Ulrich Mayr)
Date: March 31, 2025
Host: Dr. Jeremy Sharp
Guest: Dr. Ulrich Mayr, Professor for Neuroscience, University of Oregon
Episode Overview
In this episode, Dr. Jeremy Sharp welcomes Dr. Ulrich Mayr to explore a crucial and often underappreciated topic in psychological assessment: the limits of our testing data and the common pitfall of over-interpreting results, particularly profile-based scores in cognitive batteries. Drawing on Dr. Mayr's expertise in the neuroscience of cognitive measurement and his collaborative discussions with his wife, a practicing psychologist, the conversation bridges theory and clinical pragmatism. The episode addresses the reliability (and unreliability) of different test scores, what clinicians can realistically and ethically infer from testing data, and how practitioners can adhere to best practices, despite inherent measurement constraints.
Key Discussion Points & Insights
1. Why Focus on Data Confidence? (04:00)
- Dr. Mayr’s Perspective:
- Comes from basic science and measurement, focused on cognitive tests, especially executive control.
- Candidly notes he hasn’t worked clinically but engages in deep, practical discussions on testing with his psychologist wife.
- Identifies a gap between what test manuals suggest, what clinicians actually do, and what robust measurement science supports.
- Sees value in clarifying where data supports strong inferences—and where it simply doesn’t.
Quote:
"In these conversations, it often becomes clear that there is a bit of a tension between what appears to be sort of a regular practice among testing psychologists in how to interpret these profiles, these test results... and what from a more basic science side, where you recognize the methodological constraints, would seem allowable as safe and sound inferences." — Dr. Mayr (05:20)
2. The Inherent Limits of Cognitive Testing (08:34)
Quote:
"You basically went through all this process of identifying... profile based scores and you generated in the end meaningless information." — Dr. Mayr (11:38)
3. The Paradox of Reliability: Why Good ‘G’ Harms Subtest Reliability (13:08)
-
Core Problem:
- The more reliable and saturated the FSIQ/G factor, the less independent information is left in the subscales or index scores.
- After partialing out G, index score reliability drops dramatically (often between 0.2 and 0.6), far below the threshold usually considered acceptable for clinical inference (≥0.8).
-
Pie Chart Analogy:
- For any subtest or index, a large proportion (often 60%+) of its variance is shared with G; after removing G, little unique, reliable variance remains—and much of that is noise.
- Measurement error eats up about half the remaining slice; the proportion left for actual ability is small and indeterminate.
Quote:
"Once you remove the information that is specific to the general factor... about 60% of the pie chart... is eaten up by G... The remaining 40%, half of that is measurement error... That leaves us with... 20% that's actually the ability we're thinking we're measuring."
— Dr. Sharp and Dr. Mayr (20:20)
4. The Group vs. Individual Interpretation Fallacy (22:49)
- Group Studies:
- Profile and subtest scores (even with moderate reliability) can distinguish groups (e.g., ADHD vs. controls).
- Individual Clinical Cases:
- The same data is not reliable enough for individual diagnostic decisions due to large measurement error and low unique variance.
Quote:
"There's enough information there to do that [group analyses]... It's unfortunately just not enough in most cases to draw inferences about individuals." — Dr. Mayr (21:24)
5. Avoiding Over-Interpretation: Best Practice Recommendations (28:48)
Quote:
"You have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile." — Dr. Mayr (29:13)
6. Interpreting Change Over Time (35:13)
- Longitudinal Interpretation is Tricky:
- Most test scores (especially subscales/profiles) do not reliably replicate across time.
- Detectable, meaningful change is hard to prove—especially for subtests.
- More reliable if you can collect repeated, frequent measures and build an individualized baseline (dream scenario: annual/regular cognitive checks tracked over time per patient).
Quote:
"Having an individual testing history for people... would get around that problem. Now this is of course a dreamland right now, but it's doable in principle." — Dr. Mayr (39:45)
7. Practical, Evidence-Based Strategies for Clinicians (42:17)
Quote:
"...stay as much as you can with the overall level score, the full scale iq... try to extract as much meaningful information relative to the other things you know about that patient from that score... but I would stay almost completely away from the zigzags in the profiles." — Dr. Mayr (42:17)
- If you must interpret an index/subtest:
- Do so only once per assessment, with awareness of reliability, and only in line with a solid, pre-existing hypothesis.
- For reliable evaluation, look for Omega Hierarchical reliability values for your scales of interest.
8. Implications for the Testing Field: Time for Change? (48:14)
Quote:
"We basically, you know, dry riding and driving a bicycle, even though we could be driving a Porsche. And it seems like there has been very little pressure... on the testing industry to do better." — Dr. Mayr (48:33)
9. Final Takeaways and Bias Awareness (55:26)
- Confirmation Bias Alert:
- Manuals and reporting conventions tempt clinicians to overinterpret.
- Be disciplined; constantly remind yourself to avoid “seeing” meaningful patterns without adequate reliability and evidence.
Quote:
"We have to be aware of the confirmation bias that haunts everything we do and think about. And... the testing manuals... are designed to work with that confirmation bias and give it something to work with... Don't fall for that." — Dr. Mayr (55:26)
Timestamps for Key Segments
- [04:00] Why data confidence matters—Dr. Mayr’s background & motivation
- [08:34] Limits of cognitive test batteries—FSIQ vs. profile scores
- [11:25] Watkins’ study on reproducibility of profile scores
- [13:08] The paradox: why G-factor dominance undermines profiles
- [18:20] Pie chart analogy for unique vs. shared vs. error variance
- [22:49] Why group-level findings don’t generalize to individuals
- [28:48] How to interpret profiles responsibly; dangers of multiple comparisons
- [35:13] Challenges in interpreting change over time
- [42:17] Concrete recommendations for clinical data interpretation
- [48:14] The case for modernizing cognitive testing
- [55:26] The importance of bias vigilance
Notable/Memorable Quotes
- "You basically went through all this process of identifying... profile based scores and you generated in the end meaningless information." — Dr. Ulrich Mayr (11:38)
- "There is enough information there to do [group comparisons]... It’s unfortunately just not enough in most cases to draw inferences about individuals." — Dr. Ulrich Mayr (21:24)
- "You have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile." — Dr. Ulrich Mayr (29:13)
- "We basically, you know, dry riding and driving a bicycle, even though we could be driving a Porsche..." — Dr. Ulrich Mayr (48:33)
- "Don't fall for that [confirmation bias]." — Dr. Ulrich Mayr (55:26)
Resources & Further Reading
-
Omega Hierarchical Reliability:
Seek “Omega hierarchical” in reliability sections of test manuals to assess unique subtest reliability post-G-extraction (see Molly Watkins’ work).
-
Watkins, Marley W.
Research on the reproducibility and value of profile-based scores (e.g., “Psychometric Perspectives on the Assessment of Learning Disabilities”).
Practical Takeaways for Clinical Practice
- Ground your interpretations in FSIQ/general ability whenever possible.
- Avoid over-relying on subtest/index “scatter” for decisions about individual cases.
- Do not scan profiles for interesting differences—formulate and test a single hypothesis, and verify it with additional, targeted measures.
- Be cautious interpreting change across time, especially for profiles; prioritize repeated, standardized tracking for reliable trends.
- Advocate for modern assessment tools and transparency in test development; reward publishers that provide detailed reliability metrics.
Episode summary prepared for The Testing Psychologist Podcast, Episode 501, “How to Be More Confident in Our Data.”