
Loading summary
A
This is Endocrine Feedback Loop. I am your host Chase Hendrickson and welcome you to this Journal Club podcast series brought to you by the Endocrine Society. Thanks for joining us as we explore an important article recently published in one of the Society's clinical journals. Hello and welcome again to the Endocrine Feedback Loop podcast for our 49th episode in the first for our fifth season. We hope that you will join us in the audience this season as we record our next episode at Endo 2024 in Boston. But for this month's episode, we delve into the world of machine learning to take a look at how such an approach might help us clinically to determine the cause for a patient's Cushing Syndrome. These days we all have questions about how artificial intelligence will affect the care we provide to our patients in the future and and this study might give us a glimpse into that new world. As you would guess, we will need to be particularly critical about how AI might interface with a diagnostic approach that has been developed over many decades, especially since the nuances of advances like machine learning are not in the wheelhouse of many, if any, endocrinologists. We will try at least to ask some of those questions, if not give you all of the answers. Before I introduce our team today, I will remind you that I host the Endocrine Feedback Loop and work at the Vanderbilt University Medical center in Nashville, Tennessee as a general endocrinologist Medical Director Back again today as a regular contributor is the podcast resident pituitary expert Katie Gutenberg. She works at the University of Texas at Houston, where she is the director for their Endocrinology Fellowship Program and focuses her clinical care on pituitary disorders. She is a master educator at McGovern Medical School, where she teaches extensively. Our guest expert today comes to us from Cedars Sinai in Los Angeles, California, and will be well known to many of you. Odalia Cooper is internationally known for her expertise in pituitary disease with numerous publications to testify to that. At Cedars Sinai, she directs their Fellowship program in Endocrinology and their Clinical and Translational Research center with her own research focusing on invasive pituitary tumors. So as you can easily tell, the perfect pair of endocrinologists joins me to dissect a paper on Cushing's Syndrome. As is always the case, everything we say is our opinions only and not those of our respective institutions or the endocrine Society. Today we review machine learning may be an alternative to bipss in the differential diagnosis of ACTH dependent Cushing's Syndrome, which is a forthcoming article in the Journal of Clinical Endocrinology and Metabolism. Ahmet Nouman Demir at Istanbul University Seripasa, served as the first author of this paper and was joined by authors at multiple universities in Istanbul. Now I will turn our discussion over to Katie. She will walk us through the introduction and the key points that the authors make therein, as well as getting odalia to help us understand some of the key aspects of the diagnostic evaluation of Cushing's syndrome.
B
Katie thanks Chief. First, in terms of some background, ACTH dependent Cushing's syndrome is typically caused by an ACTH producing pituitary tumor. A smaller proportion of cases are due to ectopic ACTH syndrome. Several tests, including dynamic testing and imaging, are used to differentiate Cushing's disease from atopic ACTH syndrome. However, bilateral inferior petrolsal sinus sampling, which was first introduced in the 1980s, remains the gold standard. Its use is limited by a few factors. One is an invasive procedure. It's expensive and high level of expertise is required, so it's typically limited to large medical centers which can limit access to some populations. I'll stop there and let Adilia comment about IPSS and maybe perhaps some of the limitations of that test.
C
Thank you Katie and Chase for the introduction. So as you mentioned, IPSS is, you know, an invasive test and the reason why we do have to rely on it though is because pituitary MRIs in Cushing's disease often is negative and does not show a distinct adenoma ectopic ACTH syndrome tumors. They also often can have non functioning micro incidentalomas. So therefore we really do have to rely on more invasive approaches to localize a source of ACTH production. And the sensitivity of IPSS for Cushing's disease does range from 88 to 100% and with a specificity of 67 to 100%. It is relatively safe and there are some complications, but the rates do tend to be low in expert centers. Some of the more common complications like groin hematomas, vasovagal reaction, that's in less than 5%. The more serious complications though are thankfully extremely rare and they include pulmonary embolism, cranial nerve, six palsies, neurologically venous subarachnoid hemorrhage and brainstem infarction. In terms of the accuracy though, especially as we no longer have CRH available for ipss, at least in this country, we use NOW desmopressin and the reason why we now use vasopressin is that it stimulates ACH secretion as well, but it does so through binding of the pituitary vasopressin type 3 receptors and core Correll tumors do express vasopressin receptors, while for the most part ectopic tumors do not. Though there are rare exceptions where ectopic tumors do have vasopressin receptors and so in those cases actually we can get a FALS positive result. So that could be a limitation to the effectiveness of desmopressin stimulation for ipss. However, clinically it still does perform as well as crh. But there are other limitations to IPSS and one can encounter false negative results. A big limitation is the anatomic variations that can be seen. Up to 40% of patients such as anomalous venous drainage, hypoplastic petrosal sinuses, unequal drainage of cavernous sinuses, and that can lead to samples being diluted by non pituitary venous drainage. However, ponography angiograms at the time of IPSS can assist with these limitations. Further, we actually are now using prolactin levels to help in normalizing ACTH levels and bypassing the some of the unsuccessful catheterizations we may be encountering. It can take the sensitivity of IPSS from 94% to as high as 99% when you add prolactin normalization. But there are some other limitations as well to ipss. Sometimes the corchotropic adenomas are not in the cell, they're actually in the sphenoid sinus and it can be almost like an ectopic. A big one is lab error if the ACTH is not collected properly, if it's not on ice, if it's not in the ADTA tube, if there's delayed handling or processing. So you can actually get mostly lower ACTH levels and mistakenly interpret the IBSS as being an ectopic. If a pituitary adenoma lacks CRH receptors for those who use CRH stimulation, that's also another limitation. And I think a big point is that if you have patients who are not actively hypercarcinemic at the time of the ipss, such as in cyclical cushing, then you can get also a false negative result. So really you should be confirming active hypercarsolemia at the time of the ibss. Sometimes you do get false positives. For instance, an ectopic ACH syndrome tumor can also have intermittent ACTH production that can also cause that false positive. Or you may have low peripheral ACTH levels and mistakenly, such as in patients who have adrenal Cosine can also give an inaccurate interpretation. So there are these limitations that do need to be kept in mind every time we send patients for ipss. And yes, it is invasive, but again, in the hands of expert centers, it can be done well and give us a fairly high sensitivity for localizing the source.
B
Thank you for that excellent overview. So, a little bit about machine learning. So machine learning is a branch of artificial intelligence that provides information by automatically capturing patterns from databases and is not conditioned by rules. And the aim of the current study was to develop machine learning algorithms that can help in the differential diagnosis of ACTH dependent Cushing's syndrome. And so with that, I'll hand it over to Chase, who will walk us through the method.
A
Okay, thanks. So we will do what we normally do and start with thinking about the methodology for this paper. And I would describe this as a study of diagnostic accuracy. So it's exactly what Katie wrapped up there with saying that these authors were trying to understand if a machine learning algorithm would perform reasonably well in the differential diagnosis of Cushing's syndrome. And with these studies, with these studies of diagnostic accuracy, I think there's a couple of things that are helpful to keep in mind. One is, if you want to think about it in terms of a typical or a classic study design, these are just variations on cross sectional studies. So we think about a cross sectional study. An important aspect of those is all of that data is collected at a single point in time. That can be a little bit confusing, particularly because with these sorts of studies, it may take several months for all of this data to be collected. But the way that the study itself functions is all of this data is treated as if it came at one point in time. An alternative to that would be a cohort study to where you follow people over time, either prospectively or retrospectively. So that's not what's done here. All this data is collected and compared at a single point in time. And if you want to go further with that observational study design, thinking about it in terms of exposures and outcomes, this does get a little bit weird to think about it this way, but the exposures would be what's on that differential. So if you want to think about it like that, the different groups are split based into whether they have a pituitary source for their acth. So Cushing disease, or is it an ectopic source? So the entire group are those with Cushing's syndrome. But then you subdivide it, and that's how you would define the exposure. The Outcome would then be whether they tested positive or negative based on whatever test you were evaluating. So that feels a little bit weird. And it's maybe not always the most helpful way to do it, but the structure is the same there. The other thing that's helpful to keep in mind is the idea of a gold standard. We're going to come back to this on a couple different occasions as we think about this. But most tests of diagnostic accuracy are done when a new test is introduced. And the reason that you're doing this is because you have a new test that in some way is better than the current test that's used. And there can be a variety of reasons for that. I think the easiest ones to think about are tests that are cheaper, tests that are faster, tests that, as we're going to talk about here, require less expertise in doing them. And that's the reason that you might want a new test. But then the question is, but yeah, is it as good, does it perform as well as the test that we're using right now? So the current test is called the gold standard. And the way that these are built, it's necessary the way that they're structured, it is assumed that the gold standard is 100% accurate. As we all know from medicine, there's virtually no test that is 100% accurate. So we have to think about that. When you look at the gold standard and how that is used, you at least have to ask that question of how likely is it that what we're calling the gold standard, in fact, was erroneous on some occasions. So we're going to come to that here in a few minutes. Odalia is going to help us think about whether that's something that we need to be concerned about or not, particularly here. Okay, so that's enough about the basics of the study design, the methodology here. So now we'll get into this particular evaluation. So this was a multi center retrospective study and there were three referral centers. It was hard for me to determine this for sure, but I assume that they're all in Istanbul, since that's where all the authors come from from. And the patients, they included everybody who underwent that IPSS between 2016 and 2022. And you had to be followed for at least a year. Again, this was not some sort of a cohort study, so this was not following and collecting data. This was just to confirm the diagnosis. As you'll see here in just a second, they did exclude patients who did not have a pathologically confirmed diagnosis of Cushing. Disease and people who had missing data. And they included everybody, all available patients, except for those handful of folks who were excluded. Now, Odalia pointed this out as we were doing our analysis initially is that this was not a predetermined number, at least not as we can tell as it's presented here in this paper. But it's all the patients that were available. So this wasn't a sample size calculation to determine how many individuals needed to be included in order to get meaningful results. So something to keep in mind. There's. Well, we'll move on now to think about the definitions for these endocrine tests and what was a positive test and also some details of the radiological assessment. Many of these things were standard, but not everything was. So some of the initial tests that you will be familiar with, the initial test included a urine free cortisol, a late night salivary cortisol, and a 1mg dexamethasone suppression test. Importantly, at these centers, if at least two of those tests were positive, then the next step was to do a 2 day, 2 milligram dexamethasone suppression test. And a positive test used the same cutoff of 1.8 for the cortisol that confirmed the diagnosis of Cushing's syndrome. So, idealia, I thought this would be a helpful place for you to give us some insight. I think in the US at least, this is not a frequently performed test or at least not used in this way. So maybe could you give us some insight into the pros and cons of the approach here? And then also that's something else that you mentioned, is that we don't have dexamethasone levels here for any of the dexamethasone suppression tests. So what limitations might be introduced with that? So a couple questions for you.
C
Yeah, as you pointed out, right, we're used to the standard 24 hour urine free cortisols, the late night salivary cortisols, and the 1mg overnight deck suppression test. And when you look at the low dose deck suppression test that they had used to confirm the 2 day 2 milligram test, it does have greater specificity than the overnight 1 milligram DEX suppression test. But as you point out, in the US Generally don't do this test because there are issues with compliance. Patients have to take the dexamethasone exactly on time and then come in exactly at the right time to draw the levels. And we just find that the ufcs, the late night salivaries are, you know, we do multiple tests. We don't just rely on one, we repeat and repeat. And so we get really, you know, good data just from those initial screening tests where we sometimes use the low dose sex suppression test. The two day test is when we want to rule out non neoplastic physiologic hypercortisolism, what we used to call pseudo Cushing. But in general though, you know, I can't recall the last time I ordered the two day dex test as you know, in my experience. So I do question really why the authors did proceed with this two day test if they already had two out of three tests that were positive. Does it add to the sensitivity really in a. Just want to highlight one meta analysis that was done. It was a systematic review on all dynamic tests for Cushing's syndrome. It was conducted by Element et Al in 2008 in GCM. And just a two day deck suppression test that had a likelihood ratio of Cushing's syndrome of 7.3 with a diagnostic odds ratio of 50:1. However, if you compare it to using urine free cortisol, late night salivary and the 1mg dex suppression test, the diagnostic odds ratio went up to 7,965. A likelihood ratio of 174. And even if you just do two out of the three common tests, you still get a high diagnostic odd ratio of either 149 or 3000 depending which combination you use. So really the two day low dose dex suppression test doesn't add any additional information. Another point that as you mentioned, they didn't mention here in the paper that they looked at dexamethasone levels or that they confirmed validity of the cortisol testing dex levels. And the reason why that's important is that we often can see false positive results if patients are on medications that can affect the dexamethasone metabolism, the CYP3A4 inducers, or if they have malabsorption, say patients with celiac disease or chronic diarrhea, they increase gut transit time. That can also affect the metabolism of dexamethasone. You can also sometimes get false negative results from certain medications that inhibit Dex metabolism. And then anything that affects corticosteroid binding, globulin levels, albumin levels, that can also give you a falsely high cortisol. And a big one is women on oral contraceptive pills that can certainly give us a high total cortisol. And actually we our guidelines, we actually take patients off their OCPs for at least six weeks prior to doing any Dex testing, and that wasn't mentioned here, given that it was a very high female dominant cohort in both cohorts, I think that would have been a very important point to highlight when they were going through their testing.
A
Thank you. And we'll think about that more as we come to the results. Katie's going to walk us through actually how important that test, the 2 milligram two day test, turns out to be in this machine learning algorithm. So we're definitely going to have to think about that more and the applicability of these results. Back to the different cutoffs that are used here. As far as ACTH goes, if you were less than 10, then that indicated that you had ACTH independent Cushing's syndrome. But if you were greater than 20, then that was suggestive of ACTH dependent Cushing's syndrome. And then for intermediate values, a CRH or desmopressin stimulation test was performed. The imaging, and this was alluded to already, is that if you had an adenoma that was larger than 6 millimeters, then the patient was referred to surgery. But if an adenoma was less than 6 millimeters in size, or if one was not seen, that was the trigger for an IPSS. So we'll come back to the IPSS here shortly. But first we wanted to get a little bit of insight from Odelia. So this is something that you think about a lot as far as the MRIs. And there's been a lot of advances there. There are newer protocols. The authors themselves report that they used two different types of MRIs with advancing technology. So for a lot of us who are not nearly as familiar with this as you and Katie are, could you just give us a high level overview of the differences in the MRI machines that were used and what are some of the advantages of the newest technology, including some that may not have even been used here in this evaluation?
C
So, as we alluded to at the beginning, we often don't see adenomas in our Cushing disease patients. Up to 40%, even 50% may not show any adenomas. And so there is this push to try to find more sensitive ways on imaging to find those small lesions. With the 1.5 Teslon MRIS, the sensitivity is fairly poor. You will miss about half of the microadenomas. So usually we want to push for a dynamic contrast enhanced MRI if we don't see a distinct lesion. But there are other technical advancements. We now have what we call this a spoil gradient recalled acquisition echo, the SPGR, which gives us 1 millimeter slice intervals. So we can really start to look even closer for those micro adenomas. There's this fluid attenuation version recovery method, constructive interference in the steady state. These all can really enhance that detection of those adenomas. There's also these variants of the T1 weighted turbo spin echo sequences. And now we're using really standard, the ultra high field, three tesla and even now seven tesla magnets to again find those micro adenomas. So this really helps boost up the sensitivity for these lesions. So almost close to 80% for these adenomas. Not perfect. I mean, we still have a ways to go, but these techniques are definitely pushing the envelope.
A
And then back on the technique that the authors used for their IPSs. So specifically, this is a bilateral inferior pitrusal sinus sampling or BIPSS as it's abbreviated here. And the authors do this with crh. And there were as a standard peripheral and petrusal samples that were attained at baseline and then afterwards at 1 minute, 2 minutes, 5 and 10 minute intervals. And the way the gold standard is used here, so I want to give some definitions for what counted for having disease, is that Cushing's disease was diagnosed if your central to peripheral ratio was greater than three. And then you were referred to pituitary surgery for ectopic ACTH syndrome that was confirmed with a pathologic examination. Or if you did not have positive pathology, then you had to have all of these following criteria. So you had to have no gradient on that ipss, you had to have no adenoma on mri, no suppression on high dose dexamethasone suppression test, and no stimulation with crh. So Odelia, help us think about this again. This is the gold standard and we want to make sure because the way that the test works is we have to assume that the gold standard is always right. So do you have any worry with these definitions that it might have introduced some inaccuracy in that diagnosis of the cause for these individuals? Cushing's syndrome?
C
Yeah, it's a good question. I mean, it's interesting that you require all these criteria. To us just having an IPSS that does not show a gradient that would confirm it's an ectopic tumor, not having an adenoma on the mri, I'm not sure that's really quite as helpful. We certainly know it's not helpful for Cushing's disease tumors, but we also know that ectopics often can have these microser dentalomas. And in fact, if you look at their table 1, 31% of their ectopics had an adenoma on the MRI and seemed to be included in their cohort. So I wasn't quite sure how to confirm their cohort and if it included some of these patients who didn't quite meet their occlusion criteria. As far as the high dose deck suppression test, so this is now the 8 milligram test. It's not really an optimal test to rule in or out an ectopic. It does have low accuracy. There are a number of confounders to this test. It's impacted by patient age, sex, severity of hypercorasalism. And studies show overlap between ectopic and Cushing disease tumors. And the cutoff values generally felt to be a bit controversial to use this test as a way to distinguish between ectopic and Cushing disease. Certainly questionable when patients are markedly hypercarzolemic. And again, as we mentioned before, without the dexamethasone levels it would be questionable validity. And if you look again at their table 1, 11 of their 26 patients in the ectopic cohort, 42% suppressed actually on the high dose sex suppression test. And in the cushing disease cohort, 15% did not suppress. So that just goes along with how inaccurate this test is. And you get this marked overlap and it's not consistent. So that was a big concern for me on relying this test. The CRH stimulation test, as we mentioned, relies on the core controls expressing CRH receptors. But not all Cushing disease tumors will respond the same. An ectopics can express CRH receptors and might respond as well. This also has some question on their sensitivity. And really none of these biochemical tests have that 100% specificity. And results can be quite discordant in up to 65% of patients. And then the final thought I had was that I didn't see any mention of them doing pan body imaging to look for an ectopic. That's actually a key thing for us when we're trying to determine if this is an ectopic tumor. Not just, not only do they not have a gradient on the ipss, but we have to do pan imaging and looking for lesion. They didn't mention that. So there were just some questions that came to my mind about their definition of their cohort. Epic topics.
A
So a few things for us to keep in mind. But these studies always have this limitation of the gold standard. It's often not going to be correct 100% of the time and just something that we need to recognize from that aspect of it. Can't get away from it entirely, though you can potentially mitigate it by being very careful with your diagnosis and the requirements for them. So back to the methodology here and we're going to wrap up with a few words on the machine learning. As I suggested before, none of us are experts in machine learning, so we're going to focus on some of the areas I think that are a bit more understandable. The authors do a good job of being very thorough in the details that they provide. So in case there are any endocrinologists in the audience who are experts at AI, the details are in the paper and I would point you to those. But we're going to stick with some of the high level stuff that I think is helpful to wrestle with. And where we're going to start is what features were fed into this machine learning algorithm. The way this worked is the first pass that the authors had to do was to identify all of the variables that they could feed into this algorithm that could then be potentially used to determine the differential diagnosis here, which form of Cushing's syndrome these individuals had. So all of the data that they started with. So first of all, just age and sex. There were quite a few biochemical variables that came next. I'm just going to list them here. So it was ACTH, it was cortisol, it was the UFC, the late night salivary cortisol, the 1mg dexamethasone suppression test, the 2day 2mg dexamethasone suppression Test, the presence of suppression on high dose dexamethasone suppression test, what the cortisol was after the high dose dexamethasone suppression test, and then finally potassium level. There were several radiologic variables and that included whether or not you had a pituitary adenoma, if you did what the largest diameter of that adenoma was, and then also if you did what the intensity of the adenoma on a T2 weighted MRI was, they also fed into the algorithm a couple of what they call dummy variables, and that was a fasting plasma, glucose and total cholesterol. So what they did, they tried several different methods of feature selection to use, and then they compared these different variables to try to figure out which ones were important. And so they identified the ones that were the most important and removed the other ones. So the features that were selected, I'll list those again here and then I'll tell you which ones were shown to not be helpful and so were removed. But first of all, the features that were selected, it was acth and then it was all of the results of the formal diagnostic test. So the UFC, the late night salivary cortisol, the 1mg dex suppression, the 2day 2mg dex suppression, whether or not you suppressed on that high dose dex and then what the cortisol was after the high dose dex, the potassium and then as far as radiologic data goes, the diameter of the adenoma on the mri, the features that were found to not be important were age, sex, cortisol level, presence of an adenoma, the intensity of the adenoma on the T2 weighted MRI and then as expected those dummy variables and so those were removed. So to wrap up our methodology discussion here, Odelia, one last question for you. I was curious is if you found it interesting any of those features that were not found to be helpful. I at least on first blush found it interesting that just the age and the sex at least I as a general endocrinologist often think of that as maybe a clue as to what type of Cushing's syndrome somebody might have, but they certainly didn't hear. Could you give us some insight from your perspective on that?
C
I echo what you say. I was also surprised that some of these key parameters, age and sex and cortisol were removed. We know that Cushing disease tends to be more female predominant while ectopic is more male predominant. But again if you look at their table 1, their ectopic distribution for females was 88.5%. So again a very different unusual cohort. And that's probably why they didn't see sex as a significant variable. But. But I don't think that's typical of what we're seeing clinically. I definitely wonder about that. And in terms of age there's also some concern that is definitely something that we encounter with our ectopic group and presence of adenoma. We know ectopics have adenomas. We know that is not always the only variable that I would remove. Diameter being was considered significant, but I think also just the presence of an adenoma something to be considered the intensity on T2, I haven't encountered that being as an important factor so that I felt fine with removing. I do want to mention that there was last year in 2023 in the endocrine Journal a group in Peking Union that did a similar machine learning model and There they had 11 features and they included age and sex and cortisol and those were all considered significant variable. So it's not quite diving with what we are seeing with the prior studies. So again to some questions I had about this model.
A
I think the first point that Adelia made is a particularly good one, is potentially the reason that they didn't see a sex difference is if you actually look at the different groups, they had virtually the same male, female breakdown in both groups. And so it's not surprising that a machine learning algorithm didn't see a difference based on that because they were virtually the same. But then that raises questions of but is that typical for other populations? So we're going to come back to that at the very end as we think about what other data might need to be obtained. But I think that's an excellent point and a good way for us to be thinking about, about that. All right, we will wrap up the methodology there. Again, there is much more in the paper for those of you who really want to get in the weeds about exactly how this AI algorithm was built. But we will pass over that for the time being and move on to the results. And so I will hand things back over to Katie to walk us through what the authors present.
B
Starting with the patient characteristics, 131 patients underwent IPSS. The final analysis included 106 patients. So 25 patients were excluded for a variety of reasons, including things like insufficient image quality, missing biochemical evaluation, or inability to confirm an adenoma on pathologic examination. In the final cohort, 75% of patients had Cushing's disease and 25% of patients had ectopic ACTH syndrome. So next we'll talk about the performance characteristics of IPSS in this study. So, looking at a central to peripheral ratio of more than 3 after CRH, this study showed a sensitivity of 91%, a specificity of 72% and an accuracy of 85% to detect Cushing's disease. They did show increased performance with a higher ratio. So with a ratio greater than 6.95%, the sensitivity and specificity did improve and the accuracy improved again from 85% to 91%. But I'll hand this over to Adelia. I know she commented on this a little bit about some of the previously reported accuracy of ipss.
C
I was a little surprised at their accuracy of being only 85%. It's fairly low compared to what is published in the studies, as well as our experience in our center, especially now with venography and better visualization and the addition of prolactin normalization. They didn't comment whether they did that in their IPSS. There was a meta analysis of 25 studies fairly recently, in 2020, 12, 49 patients with Cushing's disease, 152 with ectopic, and their sensitivity was 97% under a serious stimulation and specificity of 100%. So again, it brings to mind, I wonder how accurate their IPSS was. Were these patients actively hypercarzolemic? You know, during the IPSS was their successful catheterization.
B
And then the next point that the authors make was they looked at the peak ACTH value. So they comment that a peak ACTH value of more than 215 yielded a sensitivity of 88% and a specificity of 90% to detect Cushing's disease. And so while that's not typically used in clinical practice, the peak ACTH value could potentially be used in the differential diagnosis of ACTH dependent Cushing's syndrome. So, Delia, I'll ask you to comment on that too.
C
They've tried to see if there are other ways to determine the accuracy of ipss. You also had mentioned, for instance, when you have a higher ratio, if it's greater than 6.95, certainly Rayo makes it more sensitive, but, you know, that lowers the specificity. So we do have our cutoffs of greater than three, you know, as being the key gradient. But when you look at just the absolute levels of acth, it's actually more helpful to determine if it was an accurate cannulation. And this is something that's been shown and written up in the guidelines, that an absolute IPSS ACH level of less than 200 at baseline, and if it's less than 400 picograms per mil post stimulation actually suggests that it wasn't a successful catheterization. And that's where the absolute ACTH levels can be helpful. So a peak ACH level more than of 215 brings to mind that again, maybe there was not a successful cannulation in some of their patients and might have given false negative results.
B
So we'll, we'll move on a little bit to the machine learning algorithm piece. So they did use four models in this study. Logistic regression was the best performing model with an accuracy, again, for the machine learning algorithm of 86%. The mean area under the curve was 0.85. Then they used something called SHAP values, which people may or may not be familiar with, but that stands for shapely additive economic explanations values. And what those mean is it helps to determine the importance of the features used to predict a particular outcome. In this study, we're looking at the importance of each feature in accurately identifying Cushing's disease. So we'll go over the top four features that again, the SHAP analysis identified as the most important features. The first two were dexamethasone suppression test. So the first was the 2 day 2 milligram dexamethasone suppression test. And of note, the cortisol values were higher in patients with ectopic ACTH syndrome. Following that was the presence of suppression after the high dose dexamethasone suppression test. Now, the authors do state that the high dose dexamethasone suppression test alone had low sensitivity and specificity in the differential diagnosis of ACTH dependent Cushing's syndrome, which is, I think, consistent with the literature. But I'll stop here. I know we've spoken a little bit about the dexamethasone suppression testing already, but Adilia, if you want to comment about, I think the high importance that the high dose dexamethasone suppression test in particular had in this algorithm.
C
As we mentioned, there are just a number of limitations to relying on this test, certainly for the differential diagnosis. But even just using the cortisol levels itself, which is what they're using in the ERQ algorithm, as we mentioned, there's just a lot of overlap between Cushing disease and ectopic. There's the confounders, the confounders of age and sex and cortisol, which they took out of their model. You know, these are confounders for the two day Dex test. So the severity of hypercortisolism, you know, all these were not accounted for in their algorithm. And so relying on the two day Dex test in this model does have some concerns. And again, we don't have that confirmation with the dexamethasone levels to really give a lot of stock into their absolute cortisol cutoff levels just to be some limitations, you know, in interpreting this data.
B
And then just rounding out the top four most important features, again, we said that dexamethasone suppression Test, both the 2 milligram 2 day dose and the high dose dexamethasone suppression test were the top two most important features. And then following that were the late night salivary cortisol, which is values were higher in patients with ectopic ACT syndrome. And then that was followed by an increase in the diameter of the adenoma which favored Cushing's disease. With that, I will hand it back over to Cheese, who will walk us through the discussion.
A
We'll start with where the authors start, which is a summary of their findings. And they say that this discrimination of cushing disease versus ectopic Cushing's syndrome by a machine learning algorithm was successfully performed with an accuracy of more than 85%. The result of our study shows that in locations where bipss cannot be performed, machine learning algorithms developed with simple clinical test can be used to identify the ACTH source. The authors then go on to state that that algorithm that they put together, that they have made it accessible through a user friendly interface. So I think it was a nice feature. That doesn't answer the question of whether that should be widely used, but I do think that was a nice thing that the authors did there. The authors go on to talk more about the limitations of ipss. Dalia walked us through all these things already in the introduction, so we won't go through them again. But I do think it is an important piece, is that if you're making an argument for a new test, you have to build a good case for why this is superior to the one that's already out there. And that's what the authors are doing here. And I do think they point out nicely the limitations of ipss. The next thing that the authors do is that they talk a little bit about machine learning and talk about how that enables consistent analysis by processing larger data sets and that it's not constrained by what they call rigid rules and, and also less susceptible to user influence. They also thought that a strength of their approach was this SHAP approach that Katie mentioned already, that it actually allows us to see what it is that's driving this algorithm so we can identify and then potentially further investigate the factors that are causing us to come up with a conclusion. The authors spent a little bit of time Talking about this 2 day, 2 milligram dexamethasone suppression test and how it was identified as the most important feature. And again they point out that this is traditionally used, at least in their center, as a confirmatory test and that it may be used in the differential diagnosis of acth deependent Cushing's syndrome, that values in ectopic were more than twice as high as those who had Cushing's disease. And interestingly they point out that if you use a cutoff of 7.74, that the authors would have been able to correctly diagnose two thirds of the patients just with that test alone. And Odalia's already walked us nicely through some of the concerns about relying on that test. And so we won't rehash those here. But I think certainly important to keep in mind, the authors point out that several Factors do have to be considered in the differential diagnosis of ACTH dependent Cushing's syndrome. And their argument here is that machine learning does consider all of those features together. It performs a comprehensive analysis and it makes predictions. And that actually helps offset the limitations of each of those individual tests which we've talked about already. They do point out some of the limitations that they have. So they mentioned that they have a fairly small number of patients with ectopic Cushing's syndrome. So I'd say it's certainly not surprising given that's a fairly rare condition. And they also point out that the ectopic source couldn't be determined by pathologic evaluation. All those patients. And again, odalia has walked us through nicely why that's a concern and introduces some uncertainty regarding that based on the criteria that are used. Another important thing that the authors point out is that for the algorithm that they have built, that you have to include all of the necessary variables for this algorithm to work and includes tests that while this center would use routinely, is that many others might not. Finally, something that we're going to come back to is these authors suggest that large scale studies are needed to determine the robustness and reproducibility of this model. So finally, the author's conclusions. They state that the algorithm showed high performance in predicting the etiology of ACTH dependent Cushing syndrome, both in the training and test datasets. Finally, it can be used as a clinical decision support tool in centers where IPSS is not available. So in a second, we're going to think about whether we should change our practice based on this information. But before we get there, I wanted us to spend just a couple of minutes thinking about the quality of this report overall. So, Katie, let's start with you. What was your sense of the quality of this study as reported by the authors?
B
I certainly can understand their goal in terms of the value of developing an algorithm that could potentially replace ipss, given the invasive nature and the limited access in certain populations. I think we've already pointed out several potential limitations of the study, both in terms of their high reliance on dexamethasone suppression testing. Some things that they didn't include that Adelia already pointed out, like they didn't include dexamethasone levels. They didn't tell us what they did for patients that might have high CBG levels that could affect those results. Other things potentially related to their accuracy of ipss, which we also mentioned before, is that other parameters like prolactin can be used now as well too, to help validate the IPSS result and help decrease the number of false negatives. So I think those are some of the potential limitations when we're looking at what data was fed into this algorithm and whether or not that would impact the accuracy.
A
Cordelia, your thoughts on the quality of the study overall?
C
I do. Similarly laud. The authors were trying to find a more cost effective, accessible way to distinguish Cushing disease from ectopic ICDH syndrome. I like that they did a lot of comprehensive testing on all their patients. I mean, they really walked each patient through so many different tests, dynamic tests. You know, clinically I, I would find it hard to get my patients to comply with all that. But, you know, I really am impressed with what they were able to accomplish. And, you know, it's important to recognize that, true, not all centers have access to ipss. The expertise. You know, some papers in the literature suggest you need to do at least 10 IPSs per year to be considered expert and have less than 10% unsuccessful cannulation rate. So, you know, it can be hard to find those centers. So. So I do understand that we do need to find alternative ways. And the recent guidelines on Koging disease do discuss trying to combine dynamic testing and imaging and trying to see if that can be predictive of coaching disease and differentiation from ectopic. Nevertheless, by the same vein, you know, not all centers are able to perform all these chemical tests, as I mentioned. And also another point that you do want to mention is that they use crh. We don't have that anymore available in the US and so whether these results are similarly applicable to the desmopressin stimulation in our centers is unclear. The final point, just the definition of the cohort of ectopic ACH syndrome, where some reservations about whether this is generalizable to our centers. This is a typical cohort that can sometimes dampen the quality of the study.
A
Odelia, let's stay with you and have you wrap things up for us. So give us your thoughts on whether we should be changing practice. So. So if you happen to be at a center that this is not readily available, you have to send patients forever away to get an ipss. Maybe you have a patient who doesn't really want to do that and you do actually have access to that. So the authors have made this algorithm accessible. So if you wanted to, you could use this. If you did all the tests that need to be fed into this algorithm to make it work. Do you think that we're ready to start using something like that or do you have enough concerns that would go along with one of the limitations that the authors suggested earlier to say, well, maybe we need some more large scale studies or, or I might also add studies of different populations before we go forward with this. So what are your thoughts on that?
C
It would be great if we could just do the biochemical test, plug into their calculator. I went online and tried to do it on some of my patients and I realized I never do the high dose decompression test. I don't have those values. But it would be tempting if I wasn't able to access ipss. But I don't think we're ready yet for prime time. I don't think yet we can change our practice. One of the issues with machine learning models is that it's very dependent on the nature and characteristics of the data and the performance of the learning algorithms that have to be suitable for the target population. Based on our discussion, there are questions about their data, their population that they use and what they fed into their algorithm. Now just to summarize just a couple of the key points, we talked about how the concern about the dynamic testing that they use, the reproducibility of the test and relying on cortisol levels and that really are quite variable and making that, you know, what the key factors to put into their calculator, I think that can lead to some difficulty in accurately diagnosing the patients correctly. We talked about the lack of availability of CRH that they relied on for their testing, the cutoffs and the variables like the age, sex, degree of hypercarcelism, those things that were taken out of their algorithm. Again, there was a recent study that showed that those were actually important parameters to build into a machine learning model. So I think taking those out is a limitation. Another thing is the data cohort that you and the data set that you feed into a machine learning algorithm. It has to be diverse and it has to be adequately sized because otherwise these models are subject to uncontrollable bias and don't account for factors, the I factors unless they're specifically programmed. This was a pretty small sample size. I mean 26 topics that were put into this algorithm. Them it was a limited geographic location and there's limited diversity of the data set. And as we mentioned, the cohort of the ectops didn't seem quite typical of our experience. They also don't have validation of tumors in the ectopic or did pan imaging to really, you know, rule in or out a lesion. So again, it's just the quality of the data set that was being fed into this algorithm gives us some reservations. And I think overall we just need a perspective, validation. We need a broader set, you know, larger data set, more diverse and in order to really, really take this to prime time and before we can say that this machine learning model can be as non inferior to an ipss, I think that's really what we would like to be, to give us that confidence before we send a patient to surgery.
A
And with that, I would like to thank Katie Gutenberg and Odelia Cooper for joining me for this month's edition of Endocrine Feedback Loop. I hope that you all learned as much as I did. Please join us in person next month as we record at Endo 2024 in Boston on June 1st at 10am local time in the Indo Expo Theater. And now you're in the loop. This has been Endocrine Feedback Loop. Endocrine Feedback Loop is brought to you by the Endocrine Society with production oversight by Brandy Brown and Andrew Harmon. If you want to like and subscribe, you can find us on Apple, Spotify, or wherever you get your podcasts. We'd love to hear your feedback on this episode of the podcast itself. Please email us@podcastren.org Endocrine Feedback Loop is a free service of the Endocrine Society. To learn more or to become a member, visit the society's website at www.endocrine.org.
Artificial Intelligence in the Diagnosis of Cushing Syndrome
Date: May 16, 2024
Host: Dr. Chase Hendrickson (Vanderbilt University Medical Center)
Guests:
This episode examines a forthcoming article in the Journal of Clinical Endocrinology and Metabolism exploring whether a machine learning (ML) algorithm can aid or even replace the gold standard, bilateral inferior petrosal sinus sampling (BIPSS), in differentiating subtypes of ACTH-dependent Cushing syndrome. With BIPSS being invasive, costly, and not widely available, the possibility of an AI-driven, less invasive diagnostic alternative sparks significant interest but also skepticism about accuracy, generalizability, and clinical applicability.
"IPSS is an invasive test... Pituitary MRIs in Cushing’s disease are often negative... Therefore, we have to rely on invasive approaches."
— Dr. Odelia Cooper (03:52)
"These authors were trying to understand if a machine learning algorithm would perform reasonably well in the differential diagnosis of Cushing's syndrome."
— Dr. Chase Hendrickson (08:07)
"I do question really why the authors did proceed with this two day test if they already had two out of three tests that were positive... The two day low dose dex suppression test doesn't add any additional information."
— Dr. Odelia Cooper (13:18)
"I was also surprised that some of these key parameters, age and sex and cortisol, were removed..."
— Dr. Odelia Cooper (26:30)
"I was a little surprised at their accuracy of being only 85%. That’s fairly low."
— Dr. Odelia Cooper (30:00)
215: Sensitivity 88%, Specificity 90% (useful for judging successful cannulation)
"Both the 2-day, 2mg dexamethasone test and high-dose dex test were the top important features... but there are just a number of limitations to relying on this test."
— Dr. Odelia Cooper (33:47)
"One of the issues with machine learning models is that it's very dependent on the nature and characteristics of the data..."
— Dr. Odelia Cooper (42:28)
On the challenge of AI-fit diagnostic medicine:
"We all have questions about how artificial intelligence will affect the care we provide... we will need to be particularly critical about how AI might interface with a diagnostic approach that has been developed over many decades."
— Dr. Chase Hendrickson (00:56)
On BIPSS limitations:
"[False negatives could result if] patients are not actively hypercortisolemic at the time of IPSS, such as in cyclical Cushing..."
— Dr. Odelia Cooper (05:50)
On generalizability:
"One of the issues with machine learning models is that it's very dependent on the nature and characteristics of the data and the performance of the learning algorithms that have to be suitable for the target population."
— Dr. Odelia Cooper (42:28)
On current clinical utility:
"I went online and tried to do it on some of my patients and realized, I never do the high dose dex [suppression] test... I don't think we're ready yet for prime time."
— Dr. Odelia Cooper (42:28)