If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
To evaluate the performance of an internally developed and previously validated artificial intelligence (AI) algorithm for magnetic resonance (MR)–derived total kidney volume (TKV) in autosomal dominant polycystic kidney disease (ADPKD) when implemented in clinical practice.
Patients and Methods
The study included adult patients with ADPKD seen by a nephrologist at our institution between November 2019 and January 2021 and undergoing an MR imaging examination as part of standard clinical care. Thirty-three nephrologists ordered MR imaging, requesting AI-based TKV calculation for 170 cases in these 161 unique patients. We tracked implementation and performance of the algorithm over 1 year. A radiologist and a radiology technologist reviewed all cases (N=170) for quality and accuracy. Manual editing of algorithm output occurred at radiology or radiology technologist discretion. Performance was assessed by comparing AI-based and manually edited segmentations via measures of similarity and dissimilarity to ensure expected performance. We analyzed ADPKD severity class assignment of algorithm-derived vs manually edited TKV to assess impact.
Results
Clinical implementation was successful. Artificial intelligence algorithm–based segmentation showed high levels of agreement and was noninferior to interobserver variability and other methods for determining TKV. Of manually edited cases (n=84), the AI-algorithm TKV output showed a small mean volume difference of –3.3%. Agreement for disease class between AI-based and manually edited segmentation was high (five cases differed).
Conclusion
Performance of an AI algorithm in real-life clinical practice can be preserved if there is careful development and validation and if the implementation environment closely matches the development conditions.
With the rapid advancement and increasing availability of artificial intelligence (AI) algorithms in medicine and in radiology specifically, there has been growing interest and investigation into their potential clinical implementation. Much of the literature to date is focused on pre-implementation topics, including algorithm development and validation, usually in a controlled setting far removed from the clinical workflow. Full clinical implementation has not yet been widely achieved among radiology practices as it requires not only algorithm development and validation, but also integration into an already complex clinical imaging environment. Process evaluations regarding translating AI innovations from discovery and validation to an integrated component of the clinical workflow are currently lacking. This process involves new challenges, including how the algorithm is ordered, how it is triggered, how it is routed, how it is monitored, and how to educate all those who will be involved at various stages of the workflow. It is important that real-life performance, which exposes the process to a myriad of unpredictable variables, matches that of a more controlled pre-implementation environment.
At our institution we have investigated a previously validated AI algorithm for magnetic resonance (MR)–derived measurement of total kidney volume (TKV) in autosomal dominant polycystic kidney disease (ADPKD) in clinical practice. Autosomal dominant polycystic kidney disease is the most common genetic cause of chronic kidney disease and TKV is an important prognostic biomarker.
Along with age, TKV reliably predicts estimated glomerular filtration rate (eGFR) decline and is used to identify patients who would benefit from specific novel therapies.
The process of clinical implementation of an AI algorithm, such as MR-derived measurement of TKV in ADPKD, involves multiple intersecting systems and people, including but not limited to patients, imaging equipment, technologists, digital data, radiologists, and referring clinicians. Successful exam ordering, image acquisition, algorithm processing, output reporting, and continuous quality assurance are all necessary for successful execution of the AI-assisted workflow.
The potential for clinical implementation of AI algorithms is what drives scientific inquiry in this field but remains an understudied step. The purpose of this study is to evaluate the performance of an internally developed and previously validated AI algorithm for TKV in ADPKD when implemented in clinical practice.
Patients And Methods
The study was performed with institutional review board approval. Details regarding this AI algorithm have been published previously.
Automatic total kidney volume measurement on follow-up magnetic resonance images to facilitate monitoring of autosomal dominant polycystic kidney disease progression.
Referring providers had the option to order AI-based TKV measurements when placing an abdominal imaging exam order (Table 1). One sequence in the exam, a routine clinical single-shot fast spin echo coronal sequence, was used by the AI algorithm (Seimens HASTE or GE SSFSE with fat saturation) for TKV calculation. AI-based segmented images were first reviewed by a medical image analyst (a certified computed tomography or MR technologist with extra training and expertise in three-dimensional image analysis and anatomic segmentation) and either accepted without any manual editing if AI segmentation was deemed to be optimal by visually comparing the output segmentation overlay to the organ borders slice-by-slice (“pass”) or manually edited (“rework”). This step was performed despite prior algorithm validation due to our commitment to extract and evaluate real-life performance metrics. A second quality control (QC) check of the output was performed by the reading radiologist. This radiologist could trigger the manual rework pathway if the radiology technologist had not, or the radiologist could accept the algorithm output or the radiology technologist–triggered manually edited segmentation if it had already been reworked. Segmentations were then approved and used to provide a report of right, left, and total kidney volumes.
Table 1Scanner, Location, and Demographic Information
For inclusion, patients were required to be older than 18 years of age, have a previous diagnosis of ADPKD, and have an MR imaging examination ordered as part of standard clinical care. Patient International Classification of Disease -10 and -9 diagnosis codes were extracted from a Mayo Clinic internal database to confirm ADPKD diagnosis. A small subset of patients where ADPKD diagnosis could not be confirmed were grouped as “other,” including cystic and noncystic kidney disease, non–polycystic kidney disease (PKD) patients, as well as kidney transplant patients and those with autosomal recessive PKD diagnoses. There are two main subclassifications for ADPKD based on presentation: typical and atypical.
The five-group classification scale ranges from least severe (class 1A) to most severe (class 1E). Subtype and classification of ADPKD were assigned by a trained observer according to previous criteria.
Demographic information, including age, sex, race, and ethnicity was collected from Digital Imaging and Communications in Medicine metadata and/or an internal patient database. Patient-related kidney function data, including eGFR, serum creatinine, blood urea nitrogen (BUN), and albumin/creatinine ratio were also extracted. All patient research authorizations were confirmed before inclusion in the study.
Statistical Analysis
Statistical analyses were performed to determine both the performance of the AI-based segmentation tool compared with manually edited AI segmentation and any potential variables which may have been associated with a manually edited segmentation. The Shapiro-Wilk test (SciPy v1.5.4) was used to determine if data were normally distributed. All statistical analyses were performed using Python (v3.8.3) and the following modules: SciPy (v1.5.4), statmodels (v0.12.2), pydicom (v2.1.1), SimpleITK (v2.0.2), seaborn (v.0.11.0), and matplotlib (v.3.2.2).
Algorithm Performance
Algorithm performance was determined via comparison of AI-based and manually edited AI segmentations for the manually edited data only. Common image metrics of similarity (Dice coefficient [two times the area of overlap divided by the total number of pixels in both segmentations; minimum value(0), maximum(1)] and Jaccard index [size of intersection divided by size union; minimum value(0), maximum(1)]) and dissimilarity (volume difference, percent volume difference, surface distance [mean of all distances between every surface voxel across segmentations; values close to zero represent perfect overlap], and Hausdorff distance [greatest of all distances between all points between segmentations; values close to zero represent perfect overlap]) were computed (SimpleITK; v2.0.2). Bland-Altman plots (pingouin v0.4.12) were constructed to look at agreement, fixed bias, and any outliers, whereas linear regression (SciPy v1.5.4) assessed correlation between AI and manually edited AI TKV measurements. A scatter plot of the Dice coefficient vs corrected AI TKV was constructed to determine if kidney volume was related to AI-based segmentation performance. Finally, a one-sided Welch’s t-test was computed to determine if the AI-based segmentation was noninferior to manually edited AI segmentation (SciPy v1.5.4). Power calculations were performed to determine the sample size needed to observe a delta value for the noninferiority test. Tests were run across a range of clinically relevant delta values to arrive at a minimum significant delta of noninferiority.
Scanner characteristics, patient demographics, and disease severity markers were investigated for association with either an AI-based segmentation accept or manually edited rework pathway. A χ2 test of independence compared distributions across accept and rework workflows for discrete variables (SciPyv1.5.4). Additionally, a two-sided Kolmogorov-Smirnov test was used to test for distributional differences between images which were accepted or sent for rework for continuous variables (SciPyv1.5.4). No adjustment for multiple comparisons was performed.
Results
Participants and Imaging
From November 2019 to January 2021, a total of 33 nephrologists across three sites within our institution ordered MR imaging, requesting AI-based TKV calculation for 170 cases in 161 unique patients. There were seven patients who were imaged at different times throughout the study. Two patients were imaged three times, whereas the remaining five were imaged twice. For these cases, the time span between exams was 184±82 days (minimum was 105 days). Of the total 170 cases, output of AI-based segmentation in 86 cases was accepted without manual editing (pass), whereas 84 cases were manually edited (rework). The workflow diagram is shown in Figure 1. In total, 12 medical image analysts and 49 radiologists were involved in this study. The mean patient age was 45.2±14.5 years, and 65.3% (N=105) were female. Nephrologist-confirmed ADPKD subtype was typical in 88.2% (N=142) of patients and atypical in 4.7% (N=8). The remaining patients (7.1%, N=11) were excluded from classification for non-PKD, kidney transplant, or autosomal recessive–PKD. Images were acquired across two scanner manufacturers, GE Medical (61%, N=104) and Siemens (39%, N=66), and nine different models in total. Coronal Half-Fourier Acquisition Single-shot Turbo spin Echo (HASTE, Siemens) or Single-Shot Fast Spin Echo (SSFSE, GE) scan protocols were used. Images were collected across two field strengths, 1.5 T (57.1%, N=97) and 3 T (42.9%, N=73), and two different slice thicknesses, 4 mm (84.1%, N=143) and 5 mm (15.9%, N=27). Breakdown of scanner, site location, and demographics across pathways is shown in Table 1.
Figure 1Workflow diagram showing the roles the clinicians (green), radiologists (blue), magnetic resonance (MR) technologists (brown), and image analysts (yellow) played in the study. An arrow indicates the sequence of steps and the direction of the workflow. The clinician is positioned at the start and end of the workflow. PACS, picture archiving and communications system; QC, quality control; TKV, total kidney volume.
To determine how well the AI algorithm for TKV performed, AI- and manually edited AI segmentations were compared. Most commonly, these corrections were minor segmentation alterations. Figure 2 presents exemplar MR images with TKV segmentation overlays (AI or manually edited) of maximum (Dice = 0.99) (Figure 2A), minimum (Dice = 0.77) (Figure 2B), and median (Dice = 0.98) (Figure 2C). Dice coefficients are shown in Table 2. The mean TKV difference was –34.0 mL (range, –413.8 to 415.4 mL) and the mean percent difference was –3.3% (range, –41.0% to 22.2%) (Table 2, Figures 3A and 3B ). AI and manually edited TKVs (mL) were highly correlated with a small volumetric offset, suggesting that most rework cases involved very minor corrections (slope = 1.0, intercept = –41.08, r2 = 0.99, P<.0001) (Figure 3C). Furthermore, the intraclass correlation coefficient between AI and manually edited TKV (mL) indicated excellent agreement (inter-rater intraclass correlation coefficient = 0.997). Dice scores were more variable with smaller corrected AI TKVs (Figure 3D). The mean Jaccard index was 0.926 (range, 0.63-0.99) (Table 2), the mean Hausdorff distance was 30.51 mm (range, 5.27-174.29 mm), and the mean surface difference was 1.68 mm (range, 0.06-18.43 mm) (Figure 3E). Finally, to confirm the AI approach was noninferior to previous non-AI–assisted segmentation approaches,
a noninferiority test was conducted. The noninferiority test was powered to a percent delta of 4.97% (2.5% one-sided type 1 error: 80% power). The percent TKV difference between AI and manually edited TKV was noninferior at a minimum percent delta of 4.80% and noninferior to previously determined inter-rater percent delta (6.21%), stereology percent delta (9.12%), and ellipsoid percent delta (22.27%) values (Figure 3F, inter-rater P<.001, stereology P<.0001, ellipsoid P<.0001).
Only 7.05% (12 of 170) of the total cases recorded differences outside the inter-rater delta (6.21%) range, yielding an approximation of the performance of the algorithm without a rework pathway. No increase in number of rework cases weras seen over time (Figure 4A ). In addition, only a small number of cases (N=7) changed image class pre/post rework (Figure 4B). These results indicate that our algorithm performs well and is noninferior to manual medical image analyst–corrected segmentations at an experimentally derived and clinically relevant delta value.
Figure 2Example images with artificial intelligence (AI)–generated and medical image analyst–corrected segmentations. A, The original computer tomography image (left), original image with AI-generated total kidney volume segmentation overlay (middle), and the original image plus medical image analyst–corrected AI overlay (right) from a case with max Dice score (0.99). Left kidney segmentation is shown in yellow and right kidney segmentation is shown in green. B, The minimum Dice score (0.77) example. C, The median Dice score (0.98) example.
Figure 3Overall performance of artificial intelligence (AI)–generated total kidney volume (TKV) segmentation compared with medical image analyst–corrected AI-generated TKV segmentation. A, Bland-Altman plots to evaluate absolute agreement between AI-generated segmentation and medical image analyst–corrected AI-generated segmentation. Mean difference between measures (blue dashed line); 95% CI for mean difference (shaded blue band); 95% limits of agreement (green dashed line; average ± 1.96 standard deviation of difference); 95% CI for limits of agreement (shaded green band). B, Same plots as A, but for percent difference between AI-generated TKV and medical image analyst–corrected AI-generated TKV. C, A linear regression of highly correlated AI-generated TKV, and medical image analyst–corrected AI-generated TKV (slope = 1.00, intercept = –41.08, r2 = 0.99, P<.0001). D, A scatter plot of medical image analyst–corrected AI-generated TKV (cc) by Dice score. E, Box plots with individual case scatter of similarity and dissimilarity metrics including Dice, Jaccard, Hausdorff distance (mm), mean surface distance (mm), and surface distance standard deviation. F, A noninferiority plot of the mean percent difference (±95% CI) between AI TKV and corrected AI TKV (gray dashed line = zero difference between methods; dark blue dashed line represents delta acquired from prior inter-rater agreement study; teal dashed line represents delta acquired from stereology measurements; pink dashed line represents delta acquired from ellipsoid measurements). Mean AI TKV and corrected AI TKV difference is noninferior to inter-rater, stereology, and ellipsoid deltas (one-sided t test; inter-rater P<.0001, stereology P<.0001, ellipsoid P<.0001).
Figure 4Comparison of study date distributions between pass and rework and classification of typical autosomal-dominant polycystic kidney disease (ADPKD) pre- or post-rework pathway. A, Kernel density estimated distributions of study dates between pass (light blue) and rework (light green) pathways that are not significantly different (two-sample Kolmogorov-Smirnov test, P=.08). B, An agreement heatmap between artificial intelligence (AI) (pre-rework) and corrected AI (post-rework) typical ADPKD classification for all patients (weighted Cohen’s kappa = 0.86). Diagonal represents perfect agreement. The darker the shade of blue represents greater counts. QC, quality control.
Determining Factors Associated With Rework Pathway
To identify factors associated with a case being sent for rework, we compared scanner information across pass and rework pathways. No significant differences in scanner manufacturer (P=.20), manufacturer model (P=.43), field strength (P=.86), slice thickness (P>.99), and pixel spacing (P=.25) were observed (Supplemental Table 1, found online at http://www.mayoclinicproceedings.org). Comparison of additional imaging parameters, including repetition time, echo time/train length, flip angle, percent sampling, image size, number of images in acquisition, field of view, and patient position were all not significantly different (Supplemental Table 2, found online at http://www.mayoclinicproceedings.org).
Furthermore, patient demographic factors across pass and rework pathways were compared. Age (P=.26), body mass index (BMI) (P=.06), race/ethnicity (P=.64), and study imaging date (P=.08) (Supplemental Figure A, found online at http://www.mayoclinicproceedings.org) were all not significantly different across AI (pass) and corrected AI (rework) pathways (Supplemental Table 1). Sex was the only measure we found that was significantly different across AI (pass) and corrected AI (rework) pathways (P=.03) (Supplemental Table 1). Females were overrepresented in the corrected AI pathway (73.8%) vs the AI pathway (57.0%) with significantly lower BMI (F [mean ± SD] = 22.42 ± 1.68; M [mean ± SD] = 25.11 ± 1.42; KS test, stat = 0.779, P<.0001) (Supplemental Figure A) and smaller total kidney volumes (F [mean ± SD] = 1299.69 ± 1072.60; M [mean ± SD] = 2153.25 ± 1835.19; KS test, stat = 0.26, P=.008) (Supplemental Figure B) compared with males.
Typical ADPKD is classified at Mayo Clinic using height-adjusted TKV and age to identify patients with the highest risk of disease progression.
The five group classification scale ranges from least severe (class 1A) to most severe (class 1E). The pre-rework and post-rework classifications were compared to determine changes in classification and degree of change. Only a small percentage of rework cases changed classification assignment after rework (10.4%). Agreement between pre-rework and post-rework classification across all cases was high (weighted Cohen’s kappa = 0.86) (Supplemental Figure C). Pre-rework and post-rework classification agreement was higher in females (weighted Cohen’s kappa = 0.90) (Supplemental Figure C) than males (weighted Cohen’s kappa = 0.74) (Supplemental Figure D). Overall, no reclassification changes of greater than one class were observed (Supplemental Figure B).
Kidney function was assessed by eG FR, serum creatinine, BUN, and albumin/creatinine ratio.
We evaluated whether kidney disease severity was associated with images routing to the rework pathway (Supplemental Table 3, found online at http://www.mayoclinicproceedings.org). Total kidney volume distributions were not significantly different between pass and rework groups (P=.23). Measurements of eGFR (P=.87), creatinine (P=.56) (Supplemental Table 3), BUN (P=.81), and albumin/creatinine ratio (P=.45) were not significantly different between pass and rework pathways.
Discussion
Advances in AI in medicine remain weighted toward algorithm development and validation with large-scale clinical implementation still unrealized. Barriers to broad clinical adoption of AI algorithms include poor understanding of the steps involved in their implementation within a practice and a lack of data on their real-world performance. Coordinated interdisciplinary efforts to integrate algorithms into clinical workflows are necessary to drive the work of AI scientists to their full potential and to use algorithms for their intended purpose.
We have shown the potential for successful clinical implementation of an AI algorithm into a complex radiology practice which required coordination of technical deployment, education of interdisciplinary stakeholders, extraction of real-life performance metrics, and analysis of impact on the intended clinical question. Our internally developed algorithm for MR-derived measurement of TKV in ADPKD was effectively integrated and performed as expected in the real-life clinical setting, proving to be noninferior to non-AI-assisted segmentation. In addition, without the AI tool, manual processing takes 60 to 90 minutes. Even in cases needing editing, the final metrics were now obtained in only a few minutes.
Technical Deployment
Technical deployment of the algorithm into the clinical workflow relied upon an integrated information technology team that could set up image filtering and routing rules based on specific inclusion criteria. In this study, routing rules were set up based on the MR series description, thereby only sending a single series for AI processing. Images moved downstream through our institutional orchestration engine,
and eventually to the medical image analysts for review before output routing to the radiologist and the picture archiving and communication system.
Education of Stakeholders
Communication and education for those involved in the AI algorithm clinical implementation are critical to success, both before any change and throughout implementation. For our algorithm, those primarily involved in the clinical workflow are the MR image–ordering clinician (nephrologist), the radiologist protocolling and interpreting the exam (including report of algorithm output), the MR technologist acquiring the images, and the medical image analysts responsible for review and possible segmentation editing.
Educational materials were developed for each role. Learning modules were available electronically and included both text and graphic presentation of the background, rationale, and steps involved for algorithm implementation. Leaders from each stakeholder group (physicians and technologists) were identified to disseminate the information and act as resources for questions. For example, the radiologist proponent sent informational emails with links to modules, presented information at divisional meetings (including history of pre-implementation algorithm validation), communicated with residents and fellows, and fielded inquiries from radiologists and trainees in real time as cases arose in the clinical practice. Throughout the educational efforts, two messages were critical to adoption of this initiative: an emphasis on real-world patient benefits of this algorithm’s implementation; and a reassurance that it would not be onerous for the radiologist, despite the inherent discomfort that accompanies workflow change.
Performance Metrics
Our extraction of real-life performance metrics relied on review of each AI-based segmentation by a medical image analyst and a radiologist. Although approximately half of the cases (84 of 170, 49.4%) during the study period were manually edited, the mean percent volume difference was just -3.3%, indicating that corrections were minor. This also indicated that the technologists had a very low threshold for editing. Therefore, the 50% which were not reworked were accepted at a very high standard. The percent TKV difference between AI-based segmentation and manually edited segmentation was noninferior to previously determined inter-rater difference and to other clinically accepted methods for determining TKV (eg, stereology-based and/or ellipsoid-based measurements).
Bias Analysis
We investigated the rework cases where the class changed pre-/post-rework to determine if there was an underlying characteristic which led to the class being changed. The variables investigated included manufacturer, scanner model, field strength, location, sex, age, race, height, weight, continuous BMI, discrete BMI interpretation, algorithm TKV value (mL), eGFR (mL/min per body surface area, creatinine (mg/dL), BUN (mg/dL), and presence of polycystic liver disease (PLD). As rework caused a shift in class for 7 of 67 reworked cases, little concrete information was believed to likely result from this investigation. In all cases, histograms were generated. For the continuous variables, the values observed for the rework individuals where a class change occurred tended to be distributed throughout without obvious clustering in a given region.
We investigated the influence of PLD in more detail. In particular, for the seven cases that switched image class, four had PLD (two with severe PLD), and three did not have PLD. Also, PLD prevalence in patients affected by PKD is ∼70%. We believe that severe PLD can often cause issues with, for example, assigning adjacent cysts to the right kidney or liver.
Impact on Intended Clinical Question
Another critical step in assessing the success of algorithm implementation is the analysis of its impact on intended clinical questions. Total kidney volume as an imaging biomarker in ADPKD is a valuable major variable for assignment of a disease severity class, a reliable and widely used predictor of future eGFR decline, and an important determinant of eligibility for certain therapies. In our study, the agreement for disease class assignment between AI-based segmentation and manually edited segmentation was high (with only five cases being assigned a different class). In the few cases of reclassification from manual editing, no changes greater than one class occurred. Given that the AI-based segmentations were shown to be noninferior to inter-rater difference and other methods of TKV calculation, we would expect a similar rate of reclassification if those methods were similarly investigated.
Next Steps
Whereas AI algorithm discovery, development, and initial validation can occur in isolation of a practice’s clinical workflow and real-time patient care, the application of these algorithms for true clinical impact cannot. Future work will include implementation of a workflow where the radiologist first reviews the cases and then triggers a pass or rework pathway, as well as the incorporation of additional analytics (eg, liver segmentation for total liver volume assessment).
Conclusion
Performance of an AI algorithm in a large radiology clinical practice can be preserved if careful attention is paid to validation of the algorithm during development and if the implementation environment closely matches the development conditions.
Potential Competing Interests
Drs Harris and Torres have received research support from Otsuka. The remaining authors report no potential competing interests.
Acknowledgments
The authors thank Lucy Bahn, PhD, for her assistance in the preparation of this manuscript.
Automatic total kidney volume measurement on follow-up magnetic resonance images to facilitate monitoring of autosomal dominant polycystic kidney disease progression.
Grant Support: This project and publication were supported in part by funding from the Department of Radiology’s Framework for AI Software Technology (FAST) at Mayo Clinic. Research reported in this publication was supported by the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health under Award Numbers K01DK110136, and R03DK125632. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.