The Testing Psychologist Podcast
Episode 501: How to Be More Confident in Our Data (w/ Dr. Ulrich Mayr)
Date: March 31, 2025
Host: Dr. Jeremy Sharp
Guest: Dr. Ulrich Mayr, Professor for Neuroscience, University of Oregon
Episode Overview
In this episode, Dr. Jeremy Sharp welcomes Dr. Ulrich Mayr to explore a crucial and often underappreciated topic in psychological assessment: the limits of our testing data and the common pitfall of over-interpreting results, particularly profile-based scores in cognitive batteries. Drawing on Dr. Mayr's expertise in the neuroscience of cognitive measurement and his collaborative discussions with his wife, a practicing psychologist, the conversation bridges theory and clinical pragmatism. The episode addresses the reliability (and unreliability) of different test scores, what clinicians can realistically and ethically infer from testing data, and how practitioners can adhere to best practices, despite inherent measurement constraints.
Key Discussion Points & Insights
1. Why Focus on Data Confidence? (04:00)
- Dr. Mayr’s Perspective:
- Comes from basic science and measurement, focused on cognitive tests, especially executive control.
- Candidly notes he hasn’t worked clinically but engages in deep, practical discussions on testing with his psychologist wife.
- Identifies a gap between what test manuals suggest, what clinicians actually do, and what robust measurement science supports.
- Sees value in clarifying where data supports strong inferences—and where it simply doesn’t.
Quote:
"In these conversations, it often becomes clear that there is a bit of a tension between what appears to be sort of a regular practice among testing psychologists in how to interpret these profiles, these test results... and what from a more basic science side, where you recognize the methodological constraints, would seem allowable as safe and sound inferences." — Dr. Mayr (05:20)
2. The Inherent Limits of Cognitive Testing (08:34)
-
Two Types of Information:
- Highly reliable: Full Scale IQ (FSIQ) / General Level/‘G’ factor. “Used pretty much as advertised.”
- Much less reliable: Profile-based scores (index and subtest deviations—the “tap dancing” around the mean).
-
Empirical Evidence:
- Reiterates findings from Marley Watkins and others:
- The strengths/weaknesses or index deviations identified in one test often fail to replicate even two years later—"their reliability was essentially zero." (11:25)
- Thus, clinical recommendations based on these profile differences “are, in the end, meaningless information.”
- Reiterates findings from Marley Watkins and others:
Quote:
"You basically went through all this process of identifying... profile based scores and you generated in the end meaningless information." — Dr. Mayr (11:38)
3. The Paradox of Reliability: Why Good ‘G’ Harms Subtest Reliability (13:08)
-
Core Problem:
- The more reliable and saturated the FSIQ/G factor, the less independent information is left in the subscales or index scores.
- After partialing out G, index score reliability drops dramatically (often between 0.2 and 0.6), far below the threshold usually considered acceptable for clinical inference (≥0.8).
-
Pie Chart Analogy:
- For any subtest or index, a large proportion (often 60%+) of its variance is shared with G; after removing G, little unique, reliable variance remains—and much of that is noise.
- Measurement error eats up about half the remaining slice; the proportion left for actual ability is small and indeterminate.
Quote:
"Once you remove the information that is specific to the general factor... about 60% of the pie chart... is eaten up by G... The remaining 40%, half of that is measurement error... That leaves us with... 20% that's actually the ability we're thinking we're measuring."
— Dr. Sharp and Dr. Mayr (20:20)
4. The Group vs. Individual Interpretation Fallacy (22:49)
- Group Studies:
- Profile and subtest scores (even with moderate reliability) can distinguish groups (e.g., ADHD vs. controls).
- Individual Clinical Cases:
- The same data is not reliable enough for individual diagnostic decisions due to large measurement error and low unique variance.
Quote:
"There's enough information there to do that [group analyses]... It's unfortunately just not enough in most cases to draw inferences about individuals." — Dr. Mayr (21:24)
5. Avoiding Over-Interpretation: Best Practice Recommendations (28:48)
-
Don’t chase every pattern or scatter in the data.
- The more comparisons you make, the greater the chance of finding a spurious difference—confidence intervals are built for single comparisons, not dozens per assessment (see: multiple comparisons problem).
- The odds of seeing spurious highs or lows increase with every additional subtest inspected.
-
Be Hypothesis-Driven:
- Formulate a specific hypothesis (e.g., “I expect a weakness in processing speed due to history and prior literature”), and test that alone.
- Avoid the “fishing expedition” approach of searching for any possible deviation.
Quote:
"You have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile." — Dr. Mayr (29:13)
-
If you spot an unexpected difference:
- Don’t overinterpret; gather convergent evidence via additional, targeted testing before drawing conclusions.
-
Statistical Correction:
- Confidence intervals must be larger if you make multiple comparisons, which typically leaves almost nothing statistically reliable in the profile scatter.
6. Interpreting Change Over Time (35:13)
- Longitudinal Interpretation is Tricky:
- Most test scores (especially subscales/profiles) do not reliably replicate across time.
- Detectable, meaningful change is hard to prove—especially for subtests.
- More reliable if you can collect repeated, frequent measures and build an individualized baseline (dream scenario: annual/regular cognitive checks tracked over time per patient).
Quote:
"Having an individual testing history for people... would get around that problem. Now this is of course a dreamland right now, but it's doable in principle." — Dr. Mayr (39:45)
7. Practical, Evidence-Based Strategies for Clinicians (42:17)
-
Focus on What’s Reliable:
- Rely on FSIQ (or equivalent), the general ability score.
- Use behavioral and history-based measures (questionnaires) as corroborative data.
-
Profile Scores:
- Avoid, unless you have robust a priori reason and statistical justification.
- Know the (low) reliability of what you’re reporting. Seek out Omega Hierarchical statistics (see resources below).
Quote:
"...stay as much as you can with the overall level score, the full scale iq... try to extract as much meaningful information relative to the other things you know about that patient from that score... but I would stay almost completely away from the zigzags in the profiles." — Dr. Mayr (42:17)
- If you must interpret an index/subtest:
- Do so only once per assessment, with awareness of reliability, and only in line with a solid, pre-existing hypothesis.
- For reliable evaluation, look for Omega Hierarchical reliability values for your scales of interest.
8. Implications for the Testing Field: Time for Change? (48:14)
-
Testing technology is outdated:
- The dominance of G in batteries makes meaningful profile interpretation almost impossible.
- Calls for modern test development:
- Batteries with unique, reliable measures beyond G.
- Adaptive, individualized testing through big data and Bayesian tools ("we could be riding bicycles while Porsches are available").
-
Systemic Barriers:
- The “locked” nature of most commercially available test batteries and the lack of cross-platform data hurts progress.
- The need for more pressure from professional associations and practitioners for improved, evidence-based test development.
Quote:
"We basically, you know, dry riding and driving a bicycle, even though we could be driving a Porsche. And it seems like there has been very little pressure... on the testing industry to do better." — Dr. Mayr (48:33)
9. Final Takeaways and Bias Awareness (55:26)
- Confirmation Bias Alert:
- Manuals and reporting conventions tempt clinicians to overinterpret.
- Be disciplined; constantly remind yourself to avoid “seeing” meaningful patterns without adequate reliability and evidence.
Quote:
"We have to be aware of the confirmation bias that haunts everything we do and think about. And... the testing manuals... are designed to work with that confirmation bias and give it something to work with... Don't fall for that." — Dr. Mayr (55:26)
Timestamps for Key Segments
- [04:00] Why data confidence matters—Dr. Mayr’s background & motivation
- [08:34] Limits of cognitive test batteries—FSIQ vs. profile scores
- [11:25] Watkins’ study on reproducibility of profile scores
- [13:08] The paradox: why G-factor dominance undermines profiles
- [18:20] Pie chart analogy for unique vs. shared vs. error variance
- [22:49] Why group-level findings don’t generalize to individuals
- [28:48] How to interpret profiles responsibly; dangers of multiple comparisons
- [35:13] Challenges in interpreting change over time
- [42:17] Concrete recommendations for clinical data interpretation
- [48:14] The case for modernizing cognitive testing
- [55:26] The importance of bias vigilance
Notable/Memorable Quotes
- "You basically went through all this process of identifying... profile based scores and you generated in the end meaningless information." — Dr. Ulrich Mayr (11:38)
- "There is enough information there to do [group comparisons]... It’s unfortunately just not enough in most cases to draw inferences about individuals." — Dr. Ulrich Mayr (21:24)
- "You have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile." — Dr. Ulrich Mayr (29:13)
- "We basically, you know, dry riding and driving a bicycle, even though we could be driving a Porsche..." — Dr. Ulrich Mayr (48:33)
- "Don't fall for that [confirmation bias]." — Dr. Ulrich Mayr (55:26)
Resources & Further Reading
-
Omega Hierarchical Reliability:
Seek “Omega hierarchical” in reliability sections of test manuals to assess unique subtest reliability post-G-extraction (see Molly Watkins’ work). -
Watkins, Marley W.
Research on the reproducibility and value of profile-based scores (e.g., “Psychometric Perspectives on the Assessment of Learning Disabilities”).
Practical Takeaways for Clinical Practice
- Ground your interpretations in FSIQ/general ability whenever possible.
- Avoid over-relying on subtest/index “scatter” for decisions about individual cases.
- Do not scan profiles for interesting differences—formulate and test a single hypothesis, and verify it with additional, targeted measures.
- Be cautious interpreting change across time, especially for profiles; prioritize repeated, standardized tracking for reliable trends.
- Advocate for modern assessment tools and transparency in test development; reward publishers that provide detailed reliability metrics.
Episode summary prepared for The Testing Psychologist Podcast, Episode 501, “How to Be More Confident in Our Data.”
