
In this episode, I talk with Dr. Ulrich Mayr, a cognitive neuroscientist, about the fundamental limitations of current cognitive and neuropsychological testing.
Loading summary
A
Hello everyone and welcome to the Testing Psychologist Podcast. I'm your host, Dr. Jeremy Sharp, licensed psychologist, group practice owner and private practice coach. Many of y' all know that I have been using TherapyNotes as our practice EHR for over 10 years now. I've looked at others and I just keep coming back to TherapyNotes because they do it all. If you're interested in an EHR for your practice, you can get two free months of therapy notes by going to thetestingpsychologist.com therapynotes and enter the code Testing this podcast is brought to you by par. Use the Pfeiffer Diagnostic Achievement Test to home in on specific reading, writing and math learning disabilities and figure out why academic issues are occurring. Learn more@parinc.com Pfeifer that's f e I F E R hey everyone, welcome to the Testing Psychologist Podcast. I'm here with a clinical episode today, a clinical topic anyway, and we're talking about over interpreting our data, which is a problem that I think a lot of us might be aware of and some of us certainly practice accordingly based on best practices. But a lot of us forget and it's easy to fall into the temptation, I think, to overcome overinterpret data when we don't necessarily have the statistical grounding to do so. So My guest today, Dr. Ulrich Mayer, is going to talk with me all about that. He is a Robert and Beverly Lewis professor for Neuroscience at the University of Oregon, where he was department head for nearly 10 years, and does NIH and NSF funded research on cognitive functioning and decision making across the adult lifespan, and he's also been editor in chief of the scientific journal Psychology and Aging. While his research is on the basic science of cognitive functioning, his partner runs a psychological testing practice which often leads to fantastic conversations where the theory and the pragmatics of assessment clash in interesting and often productive ways. So this is, I think, a good example of that today's conversation where we really tried to marry the neuroscience and the mathematics and the statistics behind test development and measurement with clinical practice and bring it home and give some suggestions for what we as clinicians can do, given that as you'll hear, a lot of the measurement and scores from our batteries cannot be interpreted or generalized the way that we think they can. So we talk about a lot of different things. We do some basics just on measurement and test development. We talk about what we can pull from the data reliably. We do get into some math and a little bit of statistics around reliability and so forth. So there's a little something for everyone. And then we conclude with a discussion of, well, what can we do, given the situation that we have right now and the measures that we have, what can we do to adhere to best practices with interpretation and gathering the data that that we want to gather? Fascinating conversation. So stay tuned and I hope that you can take some things away from this discussion with Dr. Ulrich May Ulrich, hello. Welcome to the podcast.
B
Hello. Very happy to be here.
A
Yeah, thank you for being here. Thank you for being here and willing to dive into what I think seems like a relatively complicated but important topic for those of us who are doing testing. We haven't really visited this topic in a long time, so I'm grateful to have you.
B
I believe so. I believe it's important. Nothing. What I'm saying is completely new to most people, but I think it deserves repeating every once in a while.
A
Yeah, yeah, I totally agree. I think it's one of those things that we probably learn to some degree in grad school, revisit periodically, but ultimately I think forget about honestly in the day to day work that we do because there's a lot of cognitive dissonance. If we were to fully confront it.
B
There is. And the testing manuals kind of invite you to go down these routes that are not always completely kosher.
A
Right? Right. Yes. Oh yes. So we got a lot to get into. But I'll start with the question that I always start with, which is, you know, of all the things that you could spend your time and energy on in your life, why care about this topic so much?
B
It's actually a little bit of a hobby of mine. I from my actual profession. I'm cognitive neuroscientist. I do care a lot. That's my actual interest of the building blocks of the mind. I'm interested in how to measure particularly executive control functions. That's basically what I spent my career on. And so I come at this really from a basic science perspective, from a measurement perspective. That is what I know. That is what I do for a living. And I've never. I should also make that clear. Except for a very short stint at a as a student intern in a psychiatric hospital in Munich. I never actually tested patients. So, you know, why am I here? Well, it's mostly thanks to my wife and life partner who actually is a testing psychologist. She has her own practice and specializes in testing and diagnosing ADHD and related syndromes. And we very frequently have these dinner conversations that I really enjoy where, you know, she presents or almost like little bit Like a. What they call a case round where she presents somebody that showed up in her practice and has this unique profile, what to do with it. How should I interpret that? And these conversations. It often becomes clear that there is a bit of a tension between what appears to be sort of a regular practice among testing psychologists in how to interpret these profiles, these test results, and including what the handbook suggests you should do and what from a more basic science side, where you recognize the methodological constraints would seem allowable as safe and sound inferences. And these discussions have been really interesting for both of us. I think I've been able to catch her from going down some rabbit holes every once in a while. But it also got me to think more seriously about how to use what I know most productively. And not just saying, no, you cannot do this, but maybe here are the things as far. This is how far you can go given what we know. And so, you know, combining both sort of this relatively restrictive, pessimistic view and getting also every once in a while to a yes is what I'm trying to do.
A
Yeah, yeah. I mean, I think this is. People ask me sometimes because my wife is also. She's a therapist, you know, and people will say, oh, my gosh, like, your conversations must be so fascinating. I'm like, well, they're actually, like, pretty boring. But I get the sense, you know, y' all have some of the same thing going on with these conversations that.
B
We both started on the science side and then she at some point transitioned. So we have sort of this common interest and, you know, the basic issues like that.
A
Yes, yes, that's always nice.
B
Most people would think we are complete nerds and would not spend.
A
Yeah, that's totally okay. You found you're doing it together, and that's the important thing, being nerds together. Well, let's start. This is super important. I want to lay a little bit of groundwork just for folks who. Who maybe haven't tapped into this in. In a while or have. Have forgotten or whatever it may be, but maybe we just start talking about some of the limits of our current testing methods to provide some context. And, you know, we could start with just this question of, like, what are some of the inherent limitations of our. Of the cognitive tests that we're using or neuropsych tests that we're using.
B
Well, it really comes down to one fundamental issue, and maybe I'll start with sort of the top line conclusion test batteries. And let's take the Vexler as an example. That's the one that our household is being discussed a lot. So I'm going to work with that. These test batteries really provide two categorically different types of information. The first is the general level which is best captured in the full scale iq. That is highly reliable, very meaningful and can be used pretty much as advertised. So we have no beef with that. That's the good news. Okay, then around that level, the battery offers, you know, the tap dancing of scores around that, that mean level, the strengths and weaknesses, the differences between the indices which, and I just assume that under the label the profile based scores, that those are almost always misused and should be treated with greatest caution. And I can back that up empirically. So I, and again, I want to really highlight none of what I'm saying here is new. There are other people who have researched about that. I particularly went back and read work by Professor Baylor called Marley Watkins, who I think spent much of his career in addressing these issues. One of the studies that he reports is he basically takes 400 participants are tested in the Wechsler. Then he uses the handbook based rationale for picking out for each individual the strengths and weaknesses and the critical differences in the index scores that you would reasonably interpret if this was a patient in your practice. Then the same group of participants is tested again. I think it was two and a half years later. And so now you can ask if I in a particular participant identified this particular weakness and this particular strength, will that show up again 2 and a half years later? You might be interested in that. That is something that you want to see because you're not just making an inference about this individual for right now. You hope you capture something more general about that individual. And the bad news here is that the reliability of these inferences was essentially zero. And so that's something to grapple with you basically. And you do with these in what? You know, if you really take this one result seriously, and there are others, if you take this one result seriously, you basically went through all this process of identifying, you know, the, the, the profile based scores and you generated in the end meaningless information. And you know, you may, people may make recommendations on the basic, on the basis of this information and placement decisions. And so that is something that I think it needs to be taken seriously.
A
I completely agree. Yes. And I would guess that at this point, now about five minutes into our interview, the entirety, the audience is completely freaked out and wondering what we are doing with our careers. I'm kidding.
B
Well, I mean just to, you know, I do want to go back, I Mean, there is still the mean level score, which is a completely reasonable piece of information. And in my understanding, that is what most people start with. Yeah, so then that's true. Yeah. But you know, I do think it's important to understand why the, these profile scores are so problematic.
A
Yeah.
B
And so that's.
A
Yeah, I think that'd be a good place to go. Yeah. So just establishing like, okay, we know that the full scale IQ is, is largely stable or you know, we can rely on that, we can make some inferences from that. But you're saying like basically anything else within that, the index scores, the strengths and weaknesses, those are not going to be reliable over time from what we know.
B
Yes. And you know, the, the paradoxical aspect in this, in all of this is that it's essentially exactly the strength of the overall score that harms the interpretivity. The degree to which you can interpret the individual scores, the profile based scores.
A
Ooh, Say more about that. What do you mean when you say it's a strength of the overall score that harms the others?
B
Yes. So I mean there are different ways in which you can elaborate or develop this. And I may have to try it coming from different corners because it's. Especially if you don't have a whiteboard where I can just draw some patterns to really get this across is not a trivial issue. So I hope that listeners stay with me here. But one way to think about that is we already established the overall general ability factor is something that is very strongly expressed in these test batteries. And that's all test batteries have this in common. And so this is a piece of information that is inherently independent of the profile based wiggling around the mean. So you can take, take that information out and you can literally do that. For example, if you have a bunch of profiles in front of you and you subtract the mean level out of each one of them, they collapse onto each other, the profile, the wiggling is still there, but you have taken out the mean level. So that sort of demonstrates that you can treat them as completely independent information. Now the problem is that when you look at the reliability of a particular, let's say the working memory index, highly reliable in itself. But because so much of that working memory index score is actually driven by general ability, most of the reliability in that working memory score is also dependent on that general ability score. And once you've taken that out, there's actually much less reliability left for the individual. The specific score that you then might use to detect weaknesses and strengths.
A
I see.
B
And so, and you can, you know, you can put numbers on that. So, you know, the reliability of the full scale IQ is very high.
A
It's.
B
I think it's something like 0.93 or 0.94. Once you've, once you go down to the index score and you've taken out the reliability, and you can do that mathematically, if you take out the reliability of the general factor, the general ability factor, then the remaining reliability is somewhere between 0.2 and, if you're lucky, up to 0.6. That's a. That's not a good range to be in raw, useful information about individual patients. You know, we usually say that we want to have at least 0.8 reliability to draw inferences about individual people.
A
Yeah.
B
And so that's basically what you have to work with when you're dealing with profiles. And that's really the crux of it. And it's sort of. It's. It's an annoying problem. It's even, you know, I, I feel for the test designers because a test designer wants to have reliability. The best way to get reliability is to saturate all the different tests with general ability. That's how you drive up overall reliability. Yet that same process gets in your way. When you want to interpret the individual scores, you have to put in a different assume. The Vexla battery was one where there was no relationship between the individual index scores. They were completely related. Not that we necessarily want that, because then you don't have a general ability score anymore, but in that. That would be a case where nobody would ever have a problem with interpreting profiles and differences, because now all of the information is actually in the individual index scores and you don't have to worry about that problem anymore. So it's really the, you know, the more you have the individual index scores related to general ability, the less you have to work with in terms of interpreting the wiggling of the profile.
A
Yeah, yeah. The way that you described it to me during our pre. Pre interview chat, I think really resonated. And we can maybe we can dive into that a little bit. I mean, you framed it like a pie chart where, like you said, about 50% of the pie is occupied by G or the general ability.
B
Yeah, yeah. No, that's. That's exciting. That's one of the other ways to get it that, you know, every index score, every subtest contains sort of a bucket of information. Yeah. And let's assume you can think of a bucket. You can think of a pie chart. You know, the pie chart is the whole Amount of information. Once you remove the information that is specific to the general factor and that typically removes about 50% to 80% of the overall pie chart, 0.80 might actually be a little exaggerating up to.0.6.7. So to give you one concrete example, the correlation, just look that up. The correlation between the reasoning index score and full scale IQ is 0.8. So 0.8 translates you square that, then you get to the common variance, so it's 0.64. So 64% of the pie chart that belongs to the reasoning score is taken up by the full scale iq. So once you take that out, there's only the small sliver of the pie chart left. Now that is potentially what you can work with in terms of identifying individual strengths or weaknesses. However, not all of that is pure reasonable information. At least half of it is, is likely to be measurement error. So. And you don't actually know which part of that it is, you know. So you have essentially an unknown quantity of meaningful, relatively small information left to work with to establish profile based scores or profile based information. Does that help?
A
Yeah, yeah it does. I mean I'm a super concrete person and I think the visual does help. And in the absence of a whiteboard, I'm just going to like belabor this a little bit to try to cement it for folks. So yeah, I mean thinking about this pie chart, like we said, let's just call it I don't know, 60% of the pie chart, you know, is eaten up by G. Right. So that we take that away and the remaining 40%, you said half of that is measurement error or like noise, give or take.
B
Yeah.
A
So then that leaves us with again, give or take 20ish percent or you know, a little more, a little less. That's actually the ability we're thinking we're measuring.
B
Yes. And you don't know which part of that pie chart is the meaningful and not, I mean there's way to determine that anymore because you're already taking out all the meaningful information by taking up G, you know, which we know we can measure with high reliability. So that's that. So it really leaves us, it leaves you with not much to work with. Now it sometimes gets hard to understand why this is so problem, particularly given that these, you know, an optimistic reliability of this remaining piece that we would like to work with is maybe like around 0.5. Yeah, this is not nothing. I mean this is, there is meaningful information there and it can be used for example, and that's completely legitimate. If you use, let's say you do a scientific study where you have a group of people, let's say with ADHD and a group of controls, and now you want to Compare profiles. The 0.5 reliability of each of the individual index scores is sufficient to detect differences between groups. And so potentially meaningful information about, you know, the cognitive functioning in this type of setting where you compare groups with each other can be derived from profiles. So there's enough information there to do that. It's unfortunately just not enough in most cases, except for some exceptions that we might talk about later, to draw inferences about individuals. That's of the problem, the jumping from a group comparison to the case in the clinical practice where you have just one patient in front of you at a time and need to draw inferences about that patient, not about a group of people. That's where you actually just start running into problems.
A
I think that's where people probably get tripped up. I mean, it's again, one of those things that's easy to cognitively know and understand but then hard to implement when we're sitting in front of patients and have that pressure to come up with something meaningful in the evaluation. Right. And so I mean that to me leads to, leads to a couple, couple areas of discussion. Maybe the first is just to, just to dive into that a little bit more if you can, and explain, you know, why don't those group level differences translate to individual, just to make that super clear. And then it leaves the question of are we over interpreting, you know, with these individuals?
B
Well, you know, again, it's really just a question of measurement error. And if you know, you can measure a group of individuals because you aggregate across individuals, the measurement error shrinks, whereas for one individual it remains relatively large. And so with large measurement errors, you need very high reliability to, to, to, to be able to draw inferences. So that's really the crux of it. It's really just about, you know, if something is, is unprecise, you need high reliability and otherwise you just can't draw inferences now. And yeah, the second question, are we over interpreting? Probably we are often over interpreting if you use these profile based scores. And you know, I would like to add that there may be a way to get to a better place by being very highly disciplined and understanding which of your potential scores that you might be interested in interpreting can be assessed reliably. And so, and that gets sort of, you know, there's a second problem that comes in when you try to interpret profiles. That second problem is that you're essentially looking at all of these, these wiggling ups and downs and say, oh, here's something interesting. This looks high or this looks low. And so essentially what you're doing is you're looking at all possible combinations of ups and downs at the same time. And that ignores the fact that, that when you think about the confidence interval that is generated from a certain reliability, it is always meant for a single comparison. Essentially, confidence interval means that I am willing to accept a 5% error of accepting something as a true difference that actually is not. Now that only works once. If you do that twice, the confidence interval has to increase so that you protect yourself from, you know, with every, every time you look, you again add the potential of making this measurement error. So you have to adjust your confidence interval accordingly. If you want to do that for a whole battery of 10 different tests and all possible configurations of differences, your confidence interval would have to increase so so much that you basically leave no opportunity anymore for getting anything, any difference, reliable and robust. So what that means is that the way to look at profiles is in a very highly disciplined and a priori based manner. So in a particular case you might have already some inkling because of the history and because of the background information that that patient may have a particular weakness in X. Let's say, just making that up. Processing speed. Maybe you believe based on the literature that people with ADHD1 diagnostic sign may be a lack of a slow processing speed. I've heard that I don't know about. So you say I want to confirm that this patient potentially has this diagnostic sign associated with processing associated with ADHD and that is low processing speed. So then I constrain myself to a single inspection of the profile and say I'm going to accept a potential drop in processing speed if it's big enough as diagnostic meaningful information. But I'm only going to do that once. I'm not going to sift through the whole profile and look for differences because there's, there's so much opportunity for just differences popping up randomly. If you give, it's so much opportunity, if we give differences so much opportunity to show themselves.
A
Yes.
B
So that's essentially one way to use the statistical, the information about statistical limitations and how much we can learn from potentially from such a profile and getting to sort of minimum allowable inferences from this type of information.
A
Sure. Yeah. This is, I think important to talk about for sure. And just to bring it, just make it super clear that this.
B
I don't.
A
Know, backwards reasoning or, you know, if you want to call it that. But, you know, you come in with a hypothesis and then you look at the data to test that hypothesis versus just, hey, let's see what's showing up.
B
Yeah, I think that's. So that's really the key. You know, you, you have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile.
A
Right. This might be a good time to mention just the likelihood that there are going to be some outliers in a profile. Like there are going to be some, some pretty significant differences in index scores or subtest scores or whatever it may be. Can you speak to that at all? Just the likelihood. Let's take a break to hear from a featured partner. Y' all know that I love TherapyNotes, but I am not the only one. They have a 4.9 out of 5 star rating on trustpilot.com and Google, which makes them the number one rated electronic health records system available for mental health folks today. They make billing, scheduling, note taking and telehealth all incredibly easy. They also offer custom forms that you can send through the portal for all the prescribers out there. TherapyNotes is proudly offering e prescribe as well. And maybe the most important thing for me is that they have live telephone support seven days a week so you can actually talk to a real person in a timely manner. If you're trying to switch from another ehr, the transition is incredibly easy. They'll import your demographic data free of charge so you can get going right away. So if you're curious or you want to switch or you need a new EHR, try TherapyNotes for two months absolutely free. You can go to thetestingpsychologist.com therapynotes and enter the code. Testing again totally free, no strings attached. Check it out and see why everyone is switching to TherapyNotes. The Pfeiffer Diagnostic Achievement Tests are comprehensive tools that help you help struggling students use the far fam and far to home in on specific reading, writing and math learning disabilities and figure out why academic issues are occurring instant online scoring is available via par iconnect and and in person. E stimulus books allow for more convenient and hygienic administration via tablet. Learn more@parinc.com Pfeiffer that's f e I F E R. All right, let's get back to the podcast.
B
Oh yeah, Well, I mean, I can't give you exact numbers, but it's. It's simply the case of if you look at all the possible in the strengths and weaknesses and different scores, the scatter. These are all opportunities for things that look interesting to pop up.
A
It's a good way to put it.
B
And the confidence interval that the handbook gives you are geared towards a 5% error probability. So if you have 10 different opportunities for something like that to prop up, then already you have a 50% chance that, you know, 10 times 5 is 50%. So now you have a 50% chance that something will show up. And I'm sure There are about 10 different opportunities for differences in everything. The handbook lists about what you could potentially do with these types of profiles. So if you stay with just looking at one comparison, then you accept the 5% threshold and don't move that around. You know, that's then what you work with, and that's probably more acceptable.
A
Yeah, that makes sense. I. This comes up in supervision a lot. You know, working with. We have interns and postdocs. And I think, you know, a lot of us, even as licensed clinicians, get tempted by these major differences like, oh my gosh, you know, how could this subtest be so much different than this other subtest, you know, within the same index, for example. And that's pretty typical.
B
It is. And again, I mean there, even for that, if you really believe this is a potentially important and diagnostically important question, this, this one difference that pops up that you didn't expect, but that's so unusual. Again, there is a way to deal with that and the way would be to do further testing. So now you have, let's say a verbal comprehension deficit pops up. You add additional tests that get at verbal comprehension and see whether that hypothesis is confirmed. So that would be adaptive approach towards using the information you get, but not run with it, but you, but. But design further tests not, not you as a practitioner use for the tests to confirm this potential hypothesis.
A
Yeah, I want to dig into that a little bit more in a bit. I think that's the, you know, that's the. The optimism here or the, the solution, which people, I'm sure. Yeah, yeah, yeah, yeah. But. Right. One of them. One of them. Certainly. I did want to go back to something that, you know, you talked about in the beginning and just the difficulty of comparing results over time. And I think a lot of us do that. You know, a lot of us test kids multiple times. Maybe two years apart, three years apart, four years apart, or we get an evaluation from a previous practitioner from maybe six months ago, and we get different results within that. And you know, and then it. We get Stuck with this job of oh, how do we explain that? And you know, we're trying to reverse engineer like what those differences are about. I mean, do you think that's a worthwhile pursuit? And if so, how do we do it? If not, you know, how do we ignore it?
B
You know, you know, my, my, my own background, among other things, is in lifespan aging research. And there this problem obviously comes up all the time. Oh yeah, um, and you know, from the diagnostic perspective, for example, you know, diagnosing, diagnosing something like beginning Alzheimer's should ideally depend on seeing trends where you have to make a decision. Is this trend, this, is this downhill trend more than what you would expect. So you're trying to interpret differences across test occasions? Yes, this is an extremely hard problem for which there's currently in the current testing literature, no good solution. But in the essence it's exactly the same problem, you know, because now the profile that we're looking at is not the profile across different tests at one test occasion. It's a profile across the same tests in different points in time. And so why is this the same problem? Because as we want it to be, these tests will be highly reliable. So the general factor is the correlation from one measurement occasion to the next that once you take out that common factor, the remaining stuff, the remaining information that can encapsulate the change over time, again is very unreliable and very difficult to interpret. So that's why I generally would be very, very cautious about how about interpreting changes at all. And now if you add in not just interpreting a change in the overall score like a full scale iq, but a change in profile, given what we already talked about at the beginning, namely that they just don't replicate, I would be very, very careful with doing that because you add basically a different score over different scores, you know, it's a different score in the profile and then they might change. So this is an explosion of different score uncertainty.
A
Right, right. Are there any circumstances you can think of where it is advisable or doable to interpret change over time in our.
B
Evaluations to some degree. It's of course always a matter of degrees, you know, if there. But again, if I see something that really deviates from an expected pattern, then at that point I would add additional testing to confirm whether this is sort of a one shot occurrence that then reverts back to the mean or rather some, a true effect there. So the. Now, you know, if you think in terms of long term real world potential solutions, the whole problem could be fixed in principle if we had relatively frequent assessments of individuals over time. So if I, you know, this, that's come up with an ideal testing world where everybody gets a short but highly reliable cognitive assessment once a year. So now you have an individual's timeline and each individual now is captured with their specific timeline. You don't have to compare it to norms anymore, you just compare it to that individual. And now if that individual at measurement point 35 all of a sudden shows a drop, that is potentially really meaningful because now you compare it to sort of the standard error that this individual has generated for himself through his testing history. And of course again, I probably would do additional testing and see whether there's something real, but that is a real signal that I would take seriously because it's based on information generated within that individual. You know, the, you know, I mean, this is sort of a somewhat separate problem, but we all sort of people who deal with diagnosing deficits in older age often see a patient the first time being tested in their practice. And then you see, you know, a university professor like me might have a above average score, but potentially that individual was way above average in his early years. And so you would not interpret that individual necessary as having a deficit, even though relative to his own standard he actually had a drop. And so having an individual testing history for people, for individual people, that gets around that problem. Now this is of course a dreamland right now, but it's doable in principle.
A
Sure. Well, I did an interview with some folks from a company called Boston Cognitive Assessment maybe six months ago or something. I don't know if you know them or know the test, but yeah, it's a very brief, it's a 10 minute, 10 minute assessment that you can, that you can repeat really as frequently as you would like.
B
Yeah.
A
And you know, so they're, and they're not the only ones in that space by any means. But you know, there, there are some options coming on the market. I think that that can tackle that. And it's, you know, this, I'm going to ignore the whole thing.
B
Yes, Say that again. Yeah, I would very much recommend doing something like that. Especially situation where you, there's some likelihood that you see a patient repeatedly. I mean, ideally, I think it would be something that basically in a family practice can happen while people in the waiting room just to get that type of information. That would be so useful, much more useful than any large norming studies that we are basing our information right now.
A
I was just going to say that, yeah, this opens the whole can of worms, of mental health and, you know, keeping it on the same level as physical health. But yeah, if we were doing an annual or semiannual cognitive assessment with our.
B
Primary care doc spending time and money on so many, so, so many things, why not on that?
A
Yeah, that's true. I mean, it's true. I'm with you. I'm with you. So before we transition, I mean, we've given, we've taken little dips into strategies that can help with our interpretation, but just on a, I guess a broad level or kind of big picture, given the state of things now and how most people, I think, are doing assessment, what is the most sound way to interpret our test data at this point?
B
Yeah, well, I mean, it really comes down to two things. The first one is stay as much as you can with the overall level score, the full scale iq, whatever that is in the battery that you're using, and, and try to extract as much meaningful information relative to the other things you know about that patient from that score. I know that in the ADHD diagnostic practice, you know, the, the brief or the, you know, the questionnaire based scores are very informative and very important and highly valid and highly reliable. And so comparing that to the full scale iq, I think that can be very meaningful. Whether you in particular, I'm talking a little bit beyond what I should actually know. I'm sort of parroting what I learned from my wife is that in those cases, something like the full scale IQ can really be very informative about people's potential for compensating for the deficits they have. Um, but I would stay almost completely away from the zigzags in the profiles, with the exception that I mentioned before of a trying to be well informed about the reliability of the indices that you're really interested in. And I could talk a little bit more about that, but that's. That gets very mashy. There are ways to get at that information, unfortunately. At least I checked in the Vexler handbook yesterday where they could find that information. I was not able to do so. The handbooks do not. They really want you to know the reliability, the overall reliability of the full scale IQ and of the indices that are all great. But that doesn't help you with that particular problem. You really need to know what it's called. There's a reliability score called Omega hierarchical and that reliability score tells you what is the specific reliability of, let's say verbal comprehension. After extracting out the full scale IQ reliability. That is what. No, when you need, when you know that then you can construct kind of a confidence interval of the minimum size difference between let's say the full scale iq, the general level and the verbal comprehension that you need to accept. And let's say that is 15 points which is I think somewhat realistic. If you assume a reliability of 0.5 15 point difference. But that's only for the first time you look. So that gets back to don't use that criterion of 15 for every single comparison that you can make, use it once and then stop. So that would be from my, my world sort of the, the still allowable. Maybe already somewhat shaky. But I would. If you, if you carefully apply that, I think you're still on somewhat safe grounds. But I wouldn't go.
A
I see. Now, you mentioned the behavioral questionnaires, behavior checklists. Just briefly, I know we've been talking primarily about cognitive measures and we've used, you know, the, the Wechsler measures as the example. But do you have a sense of how this all applies to the behavioral questionnaires that we administer?
B
Well, I think it's, it's important to understand what I just discussed. In no way. This is a general methodological issue that has nothing to do whether it's about cognition, cognitive or questionnaire based. If you want to go in and interpret specific facets of your questionnaire and I really I don't know very little about these so I should.
A
Sure.
B
Yeah. I'm talking very abstractly now. But if you want to interpret specific aspects, again you would have to be very mindful of how reliable these are relative to the general factor that I'm sure is also expressed in these questionnaire based measures. So the problem stays the same. You will have would have to look very carefully at the relationship between the what the equivalent of the full scale IQ in something like the brief might be and the individual scores. So the problem doesn't go away. I'm just. We was using the brief as an example of an additional piece of information outside the cognitive assessment that can be brought to bear.
A
That's fair. Thank you. And then one other component I just wanted to touch on again to make sure for anybody who missed it. What is the term you used for the statistic that we might be or the measurement that we're looking for that would capture the variance specific or the reliability. Sorry, specific to an index.
B
It's called Omega hierarchical. And I don't know whether you have something like show notes. I can can send some references. There's actually one paper by the person I mentioned before, Molly Watkins, who actually presents that type of information for the vexla, in fact, for one version of the backslide.
A
Great, great. That sounds good.
B
As a software package that you can use to extract that information from published information about the tests.
A
Fantastic. Great.
B
Somebody you should have in your show zone sometime.
A
Yeah, yeah, I'm bookmarking that certainly. Yeah, I'm going to look them up. So I think we've kept people in suspense for long enough. I would love to dive into. Yeah, how can we do better, essentially? So given what we've, everything we've just talked about, you've mentioned, you know, additional measures kind of validating the results. Let's. Let's dive into that for a bit.
B
Yeah, I mean, this is now a lot more speculative and it's also in some ways political and talking about the markets because, you know, that testing industry is a big market and you know, the, the technology that is being used is pretty much the same as 50 years ago.
A
Right.
B
We basically, you know, dry riding and driving a bicycle, if you could, even though we could be driving a Porsche. And it seems like there has been very little pressure from, you know, the, the psychological associations and so forth on the testing industry to do better. And I don't know why that is. That's not my field. But there, I think there is work to be done there to put more pressure on doing things better. And that can go in different directions. So one, maybe the most difficult one, you know, as I said before, the main problem is that our cognitive tests are saturated with G. Now it is possible that there isn't anything beyond G, and it's actually very, very difficult to go beyond. That's sort of the field that I'm in, my basic science. It is true, it's really hard to find specific, meaningful individual differences of variance beyond the G factor. So that's hard work. But I think it's worth trying to get to measurement instruments that are not. That measure individual aspects reliably and reduce the relationship to the G factor. So that would be one way to design instruments that actually give you meaningful profiles. And so, you know, ideally then you would have a much shorter battery to get at the general G factor. And then you have a bunch of satellite measures that assess the things that are still interesting but not captured already by G. So you broaden your perspective that way. The other aspects, I think that's there we get more in some methodological details that I probably don't want to bore anybody with. But there are now statistical methods that could be used to much more meaningful and adaptively design how you select tests for a given individual where you basically, you know, you test somebody. The information that you gain from that individual is immediately used to now suggest what is the most meaningful next test that you should be doing to address or test certain hypotheses. That is something for which the technology absolutely exists. You know, use of Bayesian mod modeling. You know, I don't know what, how much people know about Bayesian, But Bayesian modeling means essentially that you use the information you already have to make the best possible next, you know, search for the next possible relevant piece of information. And that can be done adaptively. And that requires. That's, that's a little bit like the idea that I suggested before. Don't go with ideally with a full scale 12 test battery. Pick a few tests that really get at the general cognitive ability and then test specific hypothesis of what might be going on and over sample those tests where you think that something interesting might be happening. You know, that would be sort of a tailored adaptive way to do that. But of course our instruments right now are not geared towards doing that. So this is something I can't. You can't ask a current practitioner just to go around and do that. You would have to have different testing technology to do that.
A
Yeah, I think that's where things get frustrating and I, you know, I don't know a lot honestly about the testing industry, but what seems to be on the surface is the fact that, you know, a lot of tests are sort of locked behind different publishing houses makes this kind of difficult. Right. So it's hard to sample from each of these, you know, these different measures and put together a truly comprehensive or meaningful battery. I suppose because you're, you know, you have to switch between different platforms and the data isn't housed in the same place and then you're doing your own calculations on the results. And you know, that, that seems hard. That seems to be a component of this.
B
Yeah, I, Oh God, I'm missing my losing my thread here.
A
There's a lot of threads.
B
I think some hope here is from, comes from the big data technology side because this is essentially a big data problem where, you know, in order to get those, these types of Bayesian estimates, you really need lots of data. You don't need to, you know, you don't need to have one sample of a thousand participants that are tested. You know, for norming studies. You just need a lot of people who do different types of tests where you can collect information and then gear these procedures based on these data. And that is a technology. It's, it's, it's a problem that is solved in principle. It just needs somebody who wants to do the R and D investment into this.
A
Yeah, yeah, yeah. And I know that there. So we're talking essentially about computerized adaptive testing here. Right. I mean, so. And just to again, make it super concrete, the theory is like, you know, you give someone like a relatively, I don't know, brief set of subtests or something, and then if they do poorly on a verbal subtest, then it triggers, hey, we're going to administer these additional 10 to 20 items looking at verbal comprehension, you know, to go deeper into whatever.
B
Exactly, yes.
A
And that technology is there. I mean, that's essentially what they use for the gre, I think, and the sat, you know.
B
Yes. And they basically have to come up with new versions every year. So it is, it, it really is doable.
A
Yeah.
B
So, yeah, somebody has to lobby Vex, you know, Pearsons or whatever they're called.
A
Someone has to do it. Yeah, there's a lot, there's a lot to consider there. And there's some downsides. I know. You know, capitalism is important and making money and selling different tests is important. And sometimes that comes up against best practices. Goodness. Are there other strategies that we can use? Anything else that can be helpful with what we've got right now in terms of interpreting and using our data in a meaningful way?
B
You know, I think the few things that I've said is sort of the one that I feel comfortable with right now. I think, you know, more generally, we, everybody, as scientists, as practitioners, we have to be aware of the confirmation bias that haunts everything we do and think about. And you know, the psychological practice is not free of that confirmation bias. And it's, you know, the, the testing manuals that present you that ready to go information about strengths and weaknesses and so forth are sort of made to, are designed to work with that confirmation bias and give it something to work with. And I think that's, that's sort of the. Oh, you know, if nothing else, something to, you know, as a take home message to get out of this. Don't fall for that.
A
I like that. I like that. Yeah, We've talked about bias on the podcast a few times in the past and I'm currently trying to schedule another guest just to talk about, you know, bias and diagnostic impressions. So it's important, it's important. I'm glad you highlighted that. Well, it's been a great discussion. I know it's. In some ways we could see see this as a little bit of a bleak discussion and there are some ways that we can combat, you know, the, the problems here. And I appreciate that you highlighted those. And it's just important to keep front and center. Right. It's easy, like I said at the beginning, to fall into the temptation to over interpret our data and succumb to the pressure of making meaning out of things to quote, unquote, help our clients.
B
So, pleasure.
A
Yeah, thanks for being here. Thank you.
B
Bye Bye.
A
All right, y'. All, thank you so much for tuning into this episode. Always grateful to have you here. I hope that you take away some information that you can implement in your practice and in your life. Any resources that we mentioned during the episode will be listed in the show notes, so make sure to check those out if you like what you hear on the podcast. I would be so grateful if you left a review on itunes or Spotify or wherever you listen to your podcasts. And if you're a practice owner or aspiring practice owner, I'd invite you to check out the Testing Psychologist Mastermind Groups. I have mastermind groups at every stage of practice development, beginner, intermediate, and advanced. We have homework. We have accountability. We have support. We have resources. These groups are amazing. We do a lot of work and a lot of connecting. If that sounds interesting to you, you can check out the details@thetestingpsychologist.com consulting. You can sign up for a pre group phone call and we will chat and figure out if a group could be a good fit for you. Thanks so much. The information contained in this podcast and on the Testing Psychologist website are intended for informational and educational purposes only. Nothing in this podcast or on the website is intended to be a substitute for professional psychological, psychiatric or medical advice, diagnosis or treatment. Please note that no doctor patient relationship is formed here and similarly, no supervisory or consultative relationship is formed between the host or guests of this podcast and listeners of this podcast. If you need the qualified advice of any mental health practitioner or medical provider, please seek one in your area. Similarly, if you need supervision on clinical matters, please find a supervisor with an expertise that fits your needs.
Date: March 31, 2025
Host: Dr. Jeremy Sharp
Guest: Dr. Ulrich Mayr, Professor for Neuroscience, University of Oregon
In this episode, Dr. Jeremy Sharp welcomes Dr. Ulrich Mayr to explore a crucial and often underappreciated topic in psychological assessment: the limits of our testing data and the common pitfall of over-interpreting results, particularly profile-based scores in cognitive batteries. Drawing on Dr. Mayr's expertise in the neuroscience of cognitive measurement and his collaborative discussions with his wife, a practicing psychologist, the conversation bridges theory and clinical pragmatism. The episode addresses the reliability (and unreliability) of different test scores, what clinicians can realistically and ethically infer from testing data, and how practitioners can adhere to best practices, despite inherent measurement constraints.
Quote:
"In these conversations, it often becomes clear that there is a bit of a tension between what appears to be sort of a regular practice among testing psychologists in how to interpret these profiles, these test results... and what from a more basic science side, where you recognize the methodological constraints, would seem allowable as safe and sound inferences." — Dr. Mayr (05:20)
Two Types of Information:
Empirical Evidence:
Quote:
"You basically went through all this process of identifying... profile based scores and you generated in the end meaningless information." — Dr. Mayr (11:38)
Core Problem:
Pie Chart Analogy:
Quote:
"Once you remove the information that is specific to the general factor... about 60% of the pie chart... is eaten up by G... The remaining 40%, half of that is measurement error... That leaves us with... 20% that's actually the ability we're thinking we're measuring."
— Dr. Sharp and Dr. Mayr (20:20)
Quote:
"There's enough information there to do that [group analyses]... It's unfortunately just not enough in most cases to draw inferences about individuals." — Dr. Mayr (21:24)
Don’t chase every pattern or scatter in the data.
Be Hypothesis-Driven:
Quote:
"You have one hypothesis and that's what you test. You don't let sort of bottom up overwhelm you with differences popping up in the profile." — Dr. Mayr (29:13)
If you spot an unexpected difference:
Statistical Correction:
Quote:
"Having an individual testing history for people... would get around that problem. Now this is of course a dreamland right now, but it's doable in principle." — Dr. Mayr (39:45)
Focus on What’s Reliable:
Profile Scores:
Quote:
"...stay as much as you can with the overall level score, the full scale iq... try to extract as much meaningful information relative to the other things you know about that patient from that score... but I would stay almost completely away from the zigzags in the profiles." — Dr. Mayr (42:17)
Testing technology is outdated:
Systemic Barriers:
Quote:
"We basically, you know, dry riding and driving a bicycle, even though we could be driving a Porsche. And it seems like there has been very little pressure... on the testing industry to do better." — Dr. Mayr (48:33)
Quote:
"We have to be aware of the confirmation bias that haunts everything we do and think about. And... the testing manuals... are designed to work with that confirmation bias and give it something to work with... Don't fall for that." — Dr. Mayr (55:26)
Omega Hierarchical Reliability:
Seek “Omega hierarchical” in reliability sections of test manuals to assess unique subtest reliability post-G-extraction (see Molly Watkins’ work).
Watkins, Marley W.
Research on the reproducibility and value of profile-based scores (e.g., “Psychometric Perspectives on the Assessment of Learning Disabilities”).
Episode summary prepared for The Testing Psychologist Podcast, Episode 501, “How to Be More Confident in Our Data.”