Advertisement
Mayo Clinic Proceedings Home

Machine Learning Techniques Differentiate Alcohol-Associated Hepatitis From Acute Cholangitis in Patients With Systemic Inflammation and Elevated Liver Enzymes

      Abstract

      Objective

      To develop machine learning algorithms (MLAs) that can differentiate patients with acute cholangitis (AC) and alcohol-associated hepatitis (AH) using simple laboratory variables.

      Methods

      A study was conducted of 459 adult patients admitted to Mayo Clinic, Rochester, with AH (n=265) or AC (n=194) from January 1, 2010, to December 31, 2019. Ten laboratory variables (white blood cell count, hemoglobin, mean corpuscular volume, platelet count, aspartate aminotransferase, alanine aminotransferase, alkaline phosphatase, total bilirubin, direct bilirubin, albumin) were collected as input variables. Eight supervised MLAs (decision tree, naive Bayes, logistic regression, k-nearest neighbor, support vector machine, artificial neural networks, random forest, gradient boosting) were trained and tested for classification of AC vs AH. External validation was performed with patients with AC (n=213) and AH (n=92) from the MIMIC-III database. A feature selection strategy was used to choose the best 5-variable combination. There were 143 physicians who took an online quiz to distinguish AC from AH using the same 10 laboratory variables alone.

      Results

      The MLAs demonstrated excellent performances with accuracies up to 0.932 and area under the curve (AUC) up to 0.986. In external validation, the MLAs showed comparable accuracy up to 0.909 and AUC up to 0.970. Feature selection in terms of information-theoretic measures was effective, and the choice of the best 5-variable subset produced high performance with an AUC up to 0.994. Physicians did worse, with mean accuracy of 0.790.

      Conclusion

      Using a few routine laboratory variables, MLAs can differentiate patients with AC and AH and may serve valuable adjunctive roles in cases of diagnostic uncertainty.

      Abbreviations and Acronyms:

      AASLD (American Association for the Study of Liver Diseases), AC (acute cholangitis), AH (alcohol-associated hepatitis), Alb (albumin), ALT (alanine aminotransferase), AP (alkaline phosphatase), AST (aspartate aminotransferase), AUC (area under the curve), CBC (complete blood count), Db (direct bilirubin), DT (decision tree), ERCP (endoscopic retrograde cholangiopancreatography), Hgb (hemoglobin), HSIC (Hilbert-Schmidt independence criterion), J-S (Jenson-Shannon), kNN (k-nearest neighbor), LR (logistic regression), MCV (mean corpuscular volume), MIMIC-III (Medical Information Mart for Intensive Care III, v1.4), MLA (machine learning algorithm), NB (naive Bayes), Plt (platelet count), PPV (positive predictive value), SIRS (systemic inflammatory response syndrome), SVM (support vector machine), Tb (total bilirubin), WBC (white blood cell count)
      Acute cholangitis (AC) is a life-threatening infection caused by obstruction of the biliary system, most commonly due to gallstones.
      • Lan Cheong Wah D.
      • Christophi C.
      • Muralidharan V.
      Acute cholangitis: current concepts.
      It is classically taught that Charcot triad (right upper quadrant pain, fever, jaundice)
      • Charcot M.
      De la fievre hepatique symptomatique—comparison avec la fievre uroseptique. Lecons sur les maladies du foie des voies biliares et des reins, Paris.
      and Reynolds pentad (hypotension and altered mental status in addition to Charcot triad)
      • Reynolds B.M.
      • Dargan E.L.
      Acute obstructive cholangitis; a distinct clinical syndrome.
      are pathognomonic signs of AC. According to the Tokyo Guidelines, a definitive diagnosis of AC requires presence of systemic inflammation (temperature >38 °C; elevated white blood cell count [WBC] or C-reactive protein level), cholestasis (total bilirubin concentration ≥2 mg/dL [to convert bilirubin values to μmol/L, multiply by 17.104]; alkaline phosphatase [AP], alanine aminotransferase [ALT], and aspartate aminotransferase [AST] activity more than 1.5 times the upper limit of normal), and imaging evidence of biliary obstruction. Because imaging evidence of biliary obstruction may be absent in more than one-third of patients, systemic inflammation and cholestasis criteria are considered sufficient to suspect a diagnosis of AC.
      • Takada T.
      Tokyo Guidelines 2018: updated Tokyo Guidelines for the management of acute cholangitis/acute cholecystitis.
      Early biliary drainage by endoscopic retrograde cholangiopancreatography (ERCP) is essential for treatment of AC.
      • Mukai S.
      • Itoi T.
      • Baron T.H.
      • et al.
      Indications and techniques of biliary drainage for acute cholangitis in updated Tokyo Guidelines 2018.
      As a result, patients meeting systemic inflammation and cholestasis criteria alone may be referred to a gastroenterologist for urgent ERCP.
      However, there is another disease in which patients can exhibit symptoms and biochemical abnormalities similar to those seen in AC: alcohol-associated hepatitis (AH). Alcohol-associated hepatitis, a clinical syndrome occurring in patients with chronic heavy alcohol consumption, is characterized by jaundice, right upper quadrant pain, and elevated liver enzymes as well as systemic inflammatory response syndrome (SIRS).
      • Ahn J.C.
      • Shah V.H.
      Alcoholic hepatitis and alcohol-related acute on chronic liver failure.
      Patients with AH exhibit SIRS with or without superimposed bacterial infection, and those with severe AH can develop hemodynamic instability and hepatic encephalopathy.
      • Singal A.K.
      • Louvet A.
      • Shah V.H.
      • Kamath P.S.
      Grand rounds: alcoholic hepatitis.
      Thus, individuals with AH can demonstrate Charcot triad or Reynolds pentad and meet the suspected diagnosis of AC in the Tokyo Guidelines. A history of alcohol use is not always forthcoming in these patients, and ultrasound evaluation of AH patients may exhibit gallbladder changes of pericholecystic fluid and wall thickening,
      • Tsaknakis B.
      • Masri R.
      • Amanzada A.
      • et al.
      Gall bladder wall thickening as non-invasive screening parameter for esophageal varices— a comparative endoscopic-sonographic study.
      further raising concerns for gallstone disease and AC. Patients with AH are not candidates for ERCP and in addition are at risk for pancreatitis. Therefore, it is essential to distinguish AH from AC among patients who present with liver enzyme abnormalities and SIRS.
      Machine learning is a field of artificial intelligence for the study of computer algorithms that learn from data. The algorithms are designed to use mathematical models for finding the patterns within data and predicting the target variable of a new datum without explicit intervention of human analyzers.
      • Darcy A.M.
      • Louie A.K.
      • Roberts L.W.
      Machine learning and the profession of medicine.
      In the field of medicine, machine learning algorithms (MLAs) are particularly suited for associating the patient data with the causes of symptoms or clinical outcomes, and they are emerging state-of-the-art techniques for clinical prediction in patients with liver diseases.
      • Ahn J.C.
      • Connell A.
      • Simonetto D.A.
      • Hughes C.
      • Shah V.H.
      The application of artificial intelligence for the diagnosis and treatment of liver diseases.
      The aim of our study was to train supervised MLAs that can assign patients with SIRS and cholestatic liver enzyme abnormalities to AC and AH using routinely available laboratory variables from the complete blood count (CBC) and liver biochemistry and to compare their predictive performances against those of human physicians.

      Methods

      Data Sources and Study Population

      We performed a retrospective analysis of Mayo Clinic’s electronic health records to identify 459 patients older than 18 years who were admitted to Mayo Clinic, Rochester, between January 1, 2010, and December 31, 2019, with either AH (n=265) or AC due to ERCP-confirmed choledocholithiasis (n=194). Patients with AH and AC were screened using International Classification of Diseases, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) codes for alcoholic hepatitis and acute cholangitis. The diagnosis of AH was confirmed by manual chart review to ensure that all patients with AH met the National Institute on Alcohol Abuse and Alcoholism definition of AH.
      • Crabb D.W.
      • Bataller R.
      • Chalasani N.P.
      • et al.
      Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia.
      For the AC cohort, manual chart review was performed to include only patients who met the definitive diagnosis criteria in the Tokyo Guidelines and received ERCP for biliary drainage. For all of the AH and AC patients, we collected 10 routinely available laboratory values from the CBC and liver biochemistry at the time of admission: WBC, hemoglobin (Hgb), mean corpuscular volume (MCV), platelet count (Plt), AST, ALT, AP, total bilirubin (Tb), direct bilirubin (Db), and albumin (Alb).

      Model Development and Assessment

      After removal of the data with missing elements, we used data from 260 patients with AH and 194 patients with AC to train 8 supervised MLAs in Python 3.0 scikit-learn library: k-nearest neighbor (kNN), logistic regression (LR), support vector machines (SVMs) with gaussian kernels, decision tree (DT), naive Bayes (NB) with gaussian class-conditional density functions, artificial neural networks, random forest, and gradient boosting classifiers. The MLAs’ performances for classifying AH vs AC with the 10 chosen laboratory variables were evaluated using 5-fold cross-validation. Additional details on hyperparameter setting and performance reporting are provided in the Supplementary Material, available online at http://www.mayoclinicproceedings.org.
      Whereas all 10 variables are relevant for discriminating AH and AC, some may contain more discriminative information than others. Data-driven feature selection can be considered for both practical and theoretical reasons. The selected features provide the interpretability of the decision and confirm the reasoning process of physicians. From the theoretical properties, the performance can be higher than that from using all features because of the generalization ability of the simple models. In this work, we employed 2 state-of-the-art information theory–based feature selection techniques to find the most relevant subset of 5 variables: Jensen-Shannon (J-S) divergence and Hilbert-Schmidt independence criterion (HSIC). The J-S divergence represents the mutual information between input features and target variables and measures the relevance of the selected features for prediction. We used the Kozachenko-Leonenko estimator for J-S divergence estimation.
      • Leonenko N.
      • Pronzato L.
      • Savani V.
      A class of Renyi information estimators for multidimensional densities.
      ,
      • Noh Y.K.
      • Sugiyama M.
      • Liu S.
      • Plessis M.C.D.
      • Park F.C.
      • Lee D.D.
      Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence.
      The HSIC considers the Hilbert-Schmidt norm of cross-covariance between 2 reproducing kernel Hilber spaces of the features and the target variables. This method captures the nonlinear relationship between features and targets due to the nonlinear projection of data to the reproducing kernel Hilber space.
      • Gretton A.
      • Bousquet O.
      • Smola A.
      • Schölkopf B.
      Measuring Statistical Dependence With Hilbert-Schmidt Norms.
      ,

      Song L, Smola A, Gretton A, Borgwardt KM, Bedo J. Supervised feature selection via dependence estimation. Paper presented at: Proceedings of the 24th International Conference on Machine Learning; June 20-24, 2007; Corvalis, OR. Accessed May 1, 2021. https://doi.org/10.1145/1273496.1273600

      We calculated the J-S divergence and HSIC for all unique combinations of 5 variables among the original 10 variables, and the 5-variable subsets with the highest and the second highest values were used for classification. Their performances were compared with the 5-variable subset with the smallest criterion estimate.

      External Validation

      For external validation of our MLAs, we used data from the Medical Information Mart for Intensive Care III, v1.4 (MIMIC-III) database. The MIMIC-III is a large data set of deidentified data associated with more than 46,000 patients who were admitted to the intensive care units at the Beth Israel Deaconess Medical Center in Boston between 2001 and 2012.
      • Johnson A.E.W.
      • Pollard T.J.
      • Shen L.
      • et al.
      MIMIC-III, a freely accessible critical care database.
      ,
      • Johnson A.E.W.
      • Pollard T.J.
      • Mark R.G.
      MIMIC-III clinical database (version 1.4).
      To best replicate our internal cohort, we restricted the MIMIC-III cohort to admissions for adults aged 18 years and older, with billing diagnosis codes for alcoholic hepatitis (ICD-9 code 571.1), cholangitis (ICD-9 code 576.1), or obstruction of bile duct (ICD-9 code 576.2) in the first 10 diagnosis codes. Clinical documentation was manually reviewed to restrict the AC subset to only those with evidence of choledocholithiasis-related disease. In addition, the 10 laboratory variables of interest were extracted for each patient to establish a laboratory panel on presentation. This resulted in an external validation cohort composed of 92 admissions for AH and 213 admissions for AC.

      Testing in Human Physicians

      To compare the MLAs’ predictive performances with those of human providers, we tested physicians’ ability to classify AH vs AC using the same 10 chosen laboratory variables only. An online survey in a quiz format was created and advertised on Twitter accounts of Mayo Clinic Division of Gastroenterology and Hepatology and the American Association for the Study of Liver Diseases (AASLD), the alcohol-associated liver disease special interest group of the AASLD, and the Facebook group for AASLD Ambassadors and Emerging Liver Scholars. Based on the platforms chosen, it was assumed that the participants would be practicing gastroenterologists and hepatologists or trainees interested in the field. The physicians were asked to identify their level of training as one of the following: resident, fellow, or attending. Each physician was given a quiz consisting of 15 randomly chosen patients and their 10 laboratory values without any other clinical context and asked to decide whether each patient had AH or AC. At the end of the quiz, the physicians were also asked to identify the combination of the top 5 variables they thought to be most influential in guiding their decisions. The human performances were compared with the performance of the ensemble of the MLAs on the same set of patients. The top 5 variables identified by physicians were compared with the Shapley Additive Explanations (SHAP)
      • Lundberg S.M.
      • Lee S.I.
      A unified approach to interpreting model predictions. Paper presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems; December 4-9, 2017; Long Beach, CA.
      measure of the ensemble.

      Results

      Baseline Characteristics of Patients

      Table 1 shows the distributions of baseline characteristics in the 2 cohorts of patients with AC and AH. Overall, the 2 groups significantly differed in the distributions of age, sex, and the variables of interest with the exception of WBC. On average, patients with AH were younger (49.6 years vs 64.3 years) and less likely to be female (36.0% vs 49.1%) compared with patients with AC. Compared with patients with AC, patients with AH had lower mean Hgb, Plt, Alb, AST, ALT, and AP while having higher mean MCV, Tb, and Db.
      Table 1Baseline Characteristics of Patients With Acute Cholangitis and Alcohol-Associated Hepatitis
      ClassAlcohol-associated hepatitis (n=265)Acute cholangitis (n=194)P value
      Age49.6 (11.2)64.3 (15.7)<.001
      Female sex36.0%49.1%<.001
      WBC12.7 (7.8)13.4 (6.2).31
      Hgb10.9 (2.3)13.2 (2.0)<.001
      MCV101.5 (12.5)89.8 (7.7)<.001
      Plt154.6 (98.7)211.2 (82.9)<.001
      Alb2.9 (0.55)4.3 (0.87).015
      AST240.8 (210.1)297.7 (285.5).014
      ALT79.5 (62.8)298.4 (261.8)<.001
      AP257.6 (169.8)317.4 (228.9).002
      Tb15.3 (11.4)5.4 (2.4)<.001
      Db11.1 (8.6)4.3 (2.2)<.001
      Alb, albumin; ALT, alanine aminotransferase; AP, alkaline phosphatase; AST, aspartate aminotransferase; Db, direct bilirubin; Hgb, hemoglobin; MCV, mean corpuscular volume; Plt, platelet count; Tb, total bilirubin; WBC, white blood cell count.
      All values are expressed in terms of mean (standard deviation), except for sex, which shows the proportion of female patients.

      Performance of MLAs Using 10 Laboratory Variables

      Overall, all 8 of the supervised MLAs showed excellent performances (±95% CI) for classification of AC and AH within the Mayo Clinic cohort using the 10 chosen laboratory variables (Figure 1A; Table 2). Random forest had the highest accuracy (0.932±0.023) and sensitivity (0.955±0.027), whereas SVMs had the highest area under the curve (AUC; 0.986±0.008), and kNN had the highest positive predictive value (PPV; 0.946±0.036) and specificity (0.939±0.014). Artificial neural networks, gradient boosting, and LR also performed well, with all the performance measures above 0.900. Naive Bayes had lower accuracy of 0.868±0.033, PPV of 0.869±0.038, and specificity of 0.810±0.070 but high AUC of 0.948±0.023 and sensitivity of 0.908±0.073. The simplest of the MLAs, DT showed the lowest performance with accuracy of 0.870±0.027, AUC of 0.893±0.013, PPV of 0.855±0.042, sensitivity of 0.931±0.020, and specificity of 0.786±0.056.
      Figure thumbnail gr1
      Figure 1Machine learning algorithm performances for classification of acute cholangitis and alcohol-associated hepatitis using the 10 chosen laboratory variables. A, The performances of machine learning algorithms for classification of acute cholangitis and alcohol-associated hepatitis in the Mayo Clinic cohort using the 10 chosen laboratory variables. B, The performances of machine learning algorithms for classification of acute cholangitis and alcohol-associated hepatitis in the MIMIC-III cohort (external validation) using the 10 chosen laboratory variables. ANN, artificial neural network; DT, decision tree; kNN, k-nearest neighbor; LR, logistic regression; NB, naive Bayes; SVM, support vector machine; RF, random forest; GB, gradient boosting.
      Table 2Machine Learning Algorithm Performances for Classification of Acute Cholangitis and Alcohol-Associated Hepatitis
      Mayo Clinic cohort
      AlgorithmAccuracyAUCPPVSensitivitySpecificity
      k-Nearest neighbor0.925±0.0140.975±0.0140.946±0.0360.921±0.0230.939±0.014
      Logistic regression0.912±0.0210.979±0.0070.924±0.0250.919±0.0350.904±0.017
      Support vector machines0.927±0.0230.986±0.0080.922±0.0370.949±0.0230.901±0.036
      Decision tree0.87±0.0270.893±0.0130.855±0.0420.931±0.020.786±0.056
      Naive Bayes0.868±0.0330.948±0.0230.869±0.0380.908±0.0730.81±0.07
      Artificial neural networks0.927±0.010.953±0.0140.93±0.0210.944±0.0150.908±0.018
      Random forest0.932±0.0230.982±0.010.93±0.0190.955±0.0270.897±0.036
      Gradient boosting0.927±0.0250.982±0.0080.924±0.0170.95±0.0280.894±0.025
      MIMIC-III cohort (external validation)
      AlgorithmAccuracyAUCPPVSensitivitySpecificity
      k-Nearest neighbor0.873±0.0140.942±0.010.928±0.0360.688±0.0230.971±0.035
      Logistic regression0.905±0.0210.97±0.0070.863±0.0250.864±0.0350.926±0.017
      Support vector machines0.896±0.0230.967±0.0080.829±0.0370.888±0.0230.901±0.036
      Decision tree0.748±0.0270.791±0.0130.595±0.0420.905±0.020.665±0.056
      Naive Bayes0.748±0.0330.876±0.0230.595±0.0380.905±0.0730.665±0.07
      Artificial neural networks0.888±0.010.947±0.0140.821±0.0210.871±0.0150.897±0.018
      Random forest0.909±0.0230.953±0.0080.863±0.0190.878±0.0270.926±0.036
      Gradient boosting0.879±0.0250.958±0.010.772±0.0170.925±0.0280.854±0.025
      AUC, area under the curve; MIMIC-III, Medical Information Mart for Intensive Care III; PPV, positive predictive value.
      The performances of machine learning algorithms for classifying acute cholangitis and alcohol-associated hepatitis in the Mayo Clinic cohort and the MIMIC-III cohort (external validation) using the 10 chosen laboratory variables. The 95% CI is presented. Boldface values are the highest value in the performance category.
      In external validation using the MIMIC-III database, the prediction performances of the MLAs with appropriate threshold setting were close to those for the initial Mayo Clinic data set (Figure 1B; Table 2). Random forest had the highest accuracy (0.909±0.023), whereas LR and SVM had high AUCs of 0.970±0.007 and 0.967±0.008, and kNN demonstrated the highest PPV (0.928±0.036) and specificity (0.971±0.035). Similar to the results seen in the Mayo Clinic cohort, DT and NB overall performed poorly, with lower accuracies, PPVs, and specificities than those of other algorithms as shown in Table 2.

      Feature Selection for Optimal 5-Variable Combinations

      To narrow our target variables using a feature selection strategy, J-S divergence and HSIC were calculated for all 252 (= (105)) possible subsets of 5-variable combinations. The 5-variable subsets were ranked according to their estimated J-S divergence and HSIC, which are presumed to reflect the amount of information they held regarding the diagnosis of AC vs AH. The subsets with the highest J-S divergence and HSIC were the sets of MCV, AST, ALT, Tb, and Db and MCV, Alb, ALT, Tb, and Db, respectively. The subsets ranked the lowest were the sets of WBC, Plt, AP, Tb, and Db for J-S divergence and WBC, Hgb, Plt, AST, and AP for HSIC.
      When we tested the ability of our MLAs to classify AC vs AH using these 5-variable subsets instead of the original 10 variables, we found that performances of the MLAs using the best combination in terms of the J-S divergence (MCV, AST, ALT, Tb, and Db) were similar to or even better than using all 10 variables, all the algorithms except DT having extremely high AUCs between 0.948 and 0.994 (Table 3). On the other hand, when the MLAs used the 5-variable combination with the lowest J-S divergence (WBC, Plt, Alb, AST, and AP), their performances were significantly worse, and no MLAs achieved AUCs above 0.900. The subsets chosen with HSIC yield performances marginally lower than those with J-S divergence, but their performances show a similar pattern to those with the J-S divergence criterion.
      Table 3Performances (AUCs) for Classification of Acute Cholangitis and Alcohol-Associated Hepatitis Using the Selected Subsets of 5 Laboratory Variables
      Jenson-Shannon divergence criterion
      AlgorithmBest set: MCV, AST, ALT, Tb, DbSecond best set: MCV, Plt, AST, ALT, TbWorst set: WBC, Plt, AP, Tb, Db
      k-Nearest neighbor0.994±0.0050.965±0.0320.823±0.036
      Logistic regression0.989±0.0120.979±0.0180.826±0.042
      Support vector machines0.984±0.0180.981±0.0170.858±0.068
      Decision tree0.887±0.0580.9±0.0810.681±0.037
      Naive Bayes0.948±0.0430.951±0.0370.817±0.042
      Artificial neural networks0.963±0.0320.98±0.0210.827±0.1
      Random forest0.978±0.0270.978±0.0160.836±0.049
      Gradient boosting0.978±0.0160.984±0.0180.858±0.023
      Hibert-Schmidt independence criterion
      AlgorithmBest set: MCV, Alb, ALT, Tb, DbSecond best set: Hgb, Alb, ALT, Tb, DbWorst set: WBC, Hgb, Plt, AST, AP
      k-Nearest neighbor0.955±0.0460.909±0.0750.808±0.094
      Logistic regression0.963±0.0430.923±0.0480.806±0.139
      Support vector machines0.944±0.0640.909±0.0790.838±0.094
      Decision tree0.844±0.1020.766±0.0870.656±0.167
      Naive bayes0.94±0.040.905±0.0540.814±0.122
      Artificial neural networks0.947±0.0590.905±0.0450.817±0.047
      Random forest0.964±0.0410.926±0.0420.847±0.114
      Gradient boosting0.97±0.040.908±0.0440.848±0.076
      Alb, albumin; ALT, alanine aminotransferase; AP, alkaline phosphatase; AST, aspartate aminotransferase; AUC, area under the curve; Db, direct bilirubin; Hgb, hemoglobin; J-S, Jenson-Shannon divergence; MCV, mean corpuscular volume; Plt, platelet count; Tb, total bilirubin; WBC, white blood count.
      The 95% CI is presented.

      Performances of Physicians

      The quiz classifying AH vs AC based on the same 10 laboratory variables used by the MLAs was completed by 143 physicians (40 residents, 41 fellows, 62 attendings). Overall, the physicians had a mean accuracy of 0.790 for classifying AH vs AC (Table 4). The fellows and the attendings performed significantly better than the residents, with mean accuracy of 0.834 and 0.807 vs 0.720 (P<.01). There were significant variations in performance in all groups according to the difficulty of discrimination, with the highest and the lowest accuracies ranging from 0.400 to 1.00. When asked to identify the variables that were the most helpful in guiding their answers, residents chose AST (90.0%), ALT (80.0%), AP (75.0%), Tb (52.5%), and WBC and MCV (tied at 47.5%) as their top 5 variables (Figure 2A). On the other hand, the fellows chose AST (91.9%), ALT (86.5%), Tb (67.6%), MCV (56.8%), and Plt (43.2%) as their top 5 variables. Similar to the fellows, the attendings chose AST (95.2%), ALT (79.0%), MCV (63.0%), Plt (63.0%), and Tb (53.2%) as their top 5 variables as well. Decision using the ensemble average of the MLA outputs in this study gave only 1 incorrect answer of 15 questions (accuracy of 0.933), and the McNemar test comparing the decisions from physicians and MLAs on those questions showed that the difference is significant (Table 4). The SHAP values for the relative contributions of the variables to the ensemble model are shown in Figure 2B.
      Table 4Accuracy of Physicians for Classifying Acute Cholangitis and Alcohol-Associated Hepatitis Using 10 Routine Laboratory Variables
      AccuracyTotal (N=143)Residents (n=40)Fellows (n=41)Attendings (n=62)Machine learning algorithmsP value (McNemar test)
      Mean (SD)0.790 (0.147)0.720 (0.165)0.834 (0.128)0.807 (0.128)0.933<.01
      McNemar test for comparison between majority voting of machine learning algorithms and physicians.
      Median (Q1-Q3)0.867 (0.733-0.867)0.733 (0.533-0.867)0.867 (0.733-0.933)0.867 (0.733-0.933)
      Lowest0.4000.4000.4670.400
      Highest1.0000.9331.001.0
      a McNemar test for comparison between majority voting of machine learning algorithms and physicians.
      Figure thumbnail gr2
      Figure 2Variables that were used as important features for physicians and machine learning algorithms. A, Variables that physicians thought were influential in guiding their decisions. B, Feature contribution by SHAP values. Alb, albumin; ALT, alanine aminotransferase; AP, alkaline phosphatase; AST, aspartate aminotransferase; Db, direct bilirubin; Hgb, hemoglobin; MCV, mean corpuscular volume; Plt, platelet count; Tb, total bilirubin; WBC, white blood cell count.

      Discussion

      In this study, supervised MLAs trained on routine laboratory variables from well-defined cohorts of patients with AC and AH were able to process similar information from new sets of patients they had not previously seen to make the diagnosis of AC vs AH with impressive performance. Of the 8 MLAs, 6 had accuracies above 0.90, and even the worst-performing NB model had an accuracy of 0.868. These performances were significantly better than those of human physicians. The MLAs continued to demonstrate excellent performances even when applied to an external cohort of intensive care unit patients, demonstrating their generalizability.
      Machine learning algorithms are data-driven approaches that analyze the underlying patterns within data. Human physicians use the aggregation of their knowledge about the effects of individual variables on the target. Conversely, MLAs usually consider the associations among the variables themselves in addition to the variable-target associations.
      • Sidey-Gibbons J.A.M.
      • Sidey-Gibbons C.J.
      Machine learning in medicine: a practical introduction.
      All algorithms used in this study are paradigmatic methods, and the advantages and disadvantages of each algorithm have been well understood in various applications. Logistic regression has been widely used in medicine, with its linear separating hyperplane fit by data in a discriminative way.
      • Meurer W.J.
      • Tolles J.
      Logistic regression diagnostics: understanding how well a model predicts outcomes.
      The discriminative way of learning is more flexible but at the same time less generalizable than the generative methods, such as NB models, by providing less focus on the learning of the class-conditional density functions.
      • Zhang Z.
      Naïve Bayes classification in R.
      Support vector machines, which performed well in both our Mayo and MIMIC-III cohorts, use kernels to perform nonlinear classification of high-dimensional features and are particularly well suited for unstructured, complex, and high-dimensional data.
      • Ben-Hur A.
      • Ong C.S.
      • Sonnenburg S.
      • Schölkopf B.
      • Rätsch G.
      Support vector machines and kernels for computational biology.
      Artificial neural networks are collections of artificial “neurons” that are connected to other layers of neurons by mathematical formulas and have the ability to detect complex, nonlinear relationships.
      • Renganathan V.
      Overview of artificial neural network models in the biomedical domain.
      Decision tree evaluates the variables 1 by 1 to build a tree structure that splits data with different labels as quickly as possible,
      • Karalis G.
      Decision trees and applications.
      and random forest uses an ensemble of DTs with a bagging and a random subset of features.
      • Doupe P.
      • Faghmous J.
      • Basu S.
      Machine learning for health services researchers.
      Finally, kNN is simple and guarantees achievement of its classification optimality with infinitely many data, but it usually does not generalize well with finite samples.
      • Abu Alfeilat H.A.
      • Hassanat A.B.A.
      • Lasassmeh O.
      • et al.
      Effects of distance measure choice on k-nearest neighbor classifier performance: a review.
      The application of the information-theoretic measures J-S divergence and HSIC was effective in selecting the 5 most important input features. By examining all possible combinations of 5 among 10 variables, we were able to conduct feature selection and further simplify the MLAs without sacrificing but rather increasing their predictive performances. Interestingly, the 5-variable combinations chosen to be the best by J-S divergence (MCV, AST, ALT, Tb, Db) and HSIC (MCV, Alb, ALT, Tb, Db) were similar to the best 5-variable combination chosen by attending and fellow gastroenterologists and hepatologists according to their clinical insights (MCV, Plt, AST, ALT, Tb); in fact, this combination was chosen as the second best set in terms of the J-S divergence. However, even when the selected set of features were roughly the same, the accuracy scores between the physicians and the MLAs were significantly different. This discrepancy shows that the detailed settings of decision boundary and thresholds are important even when the important variables have been correctly chosen.
      In real-life clinical practice, no physician would rely solely on the CBC and the liver biochemistry to make a diagnosis of either AC or AH. In most circumstances, clinicians obtain a detailed history and perform additional laboratory or imaging tests to make the final diagnosis in a patient with liver enzyme abnormalities and SIRS. The physicians’ performances on the quiz most likely would have been significantly higher if they were provided with relevant history and imaging information. Nevertheless, there are many clinical instances of gastroenterologists receiving consults from the emergency department, the inpatient wards, or the intensive care unit for urgent ERCP in patients with hyperbilirubinemia and SIRS who initially deny a history of alcohol use but later turn out to have AH. In some situations, inability to obtain a reliable history from patients with altered mental status or lack of access to imaging modalities in underserved areas may force providers to make the determination on the basis of a limited amount of objective data. If made easily accessible in the form of an online calculator or a smartphone application, our MLAs might also be helpful to midlevel providers or subspecialists who may not be too familiar with how to approach an acutely ill patient with liver enzyme abnormalities.
      Most important, our study is a proof of concept demonstrating how MLAs using a few simple variables and routinely available structured clinical information may serve as highly potent prediction tools. With a similar approach to our AC vs AH prediction model, it might be feasible to develop comprehensive MLA-based software that can guide medical providers treating patients with liver enzyme abnormalities and aid in decision-making.

      Conclusion

      Supervised MLAs can successfully distinguish patients with AH from patients with AC on the basis of a small number of simple, inexpensive, and routinely available laboratory variables and may serve valuable adjunctive roles in the presence of diagnostic uncertainty or for expedited triage when accurate histories and imaging are not readily available. The MLAs based on simple structured numeric data may serve as highly accurate predictors in various clinical scenarios.

      Potential Competing Interests

      The authors report no competing interests.

      Acknowledgments

      Joseph C. Ahn and Yung-Kyun Noh contributed equally to this study.

      Supplemental Online Material

      References

        • Lan Cheong Wah D.
        • Christophi C.
        • Muralidharan V.
        Acute cholangitis: current concepts.
        ANZ J Surg. 2017; 87: 554-559
        • Charcot M.
        De la fievre hepatique symptomatique—comparison avec la fievre uroseptique. Lecons sur les maladies du foie des voies biliares et des reins, Paris.
        Bourneville et Sevestre. 1877; : 176-185
        • Reynolds B.M.
        • Dargan E.L.
        Acute obstructive cholangitis; a distinct clinical syndrome.
        Ann Surg. 1959; 150: 299-303
        • Takada T.
        Tokyo Guidelines 2018: updated Tokyo Guidelines for the management of acute cholangitis/acute cholecystitis.
        J Hepatobiliary Pancreat Sci. 2018; 25: 1-2
        • Mukai S.
        • Itoi T.
        • Baron T.H.
        • et al.
        Indications and techniques of biliary drainage for acute cholangitis in updated Tokyo Guidelines 2018.
        J Hepatobiliary Pancreat Sci. 2017; 24: 537-549
        • Ahn J.C.
        • Shah V.H.
        Alcoholic hepatitis and alcohol-related acute on chronic liver failure.
        in: Pyrsopoulos N. Liver Failure: Acute and Acute on Chronic. Springer International Publishing, 2020: 281-302
        • Singal A.K.
        • Louvet A.
        • Shah V.H.
        • Kamath P.S.
        Grand rounds: alcoholic hepatitis.
        J Hepatol. 2018; 69: 534-543
        • Tsaknakis B.
        • Masri R.
        • Amanzada A.
        • et al.
        Gall bladder wall thickening as non-invasive screening parameter for esophageal varices— a comparative endoscopic-sonographic study.
        BMC Gastroenterol. 2018; 18: 123
        • Darcy A.M.
        • Louie A.K.
        • Roberts L.W.
        Machine learning and the profession of medicine.
        JAMA. 2016; 315: 551-552
        • Ahn J.C.
        • Connell A.
        • Simonetto D.A.
        • Hughes C.
        • Shah V.H.
        The application of artificial intelligence for the diagnosis and treatment of liver diseases.
        Hepatology. 2021; 73: 2546-2563
        • Crabb D.W.
        • Bataller R.
        • Chalasani N.P.
        • et al.
        Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia.
        Gastroenterology. 2016; 150: 785-790
        • Leonenko N.
        • Pronzato L.
        • Savani V.
        A class of Renyi information estimators for multidimensional densities.
        Ann Statist. 2008; 36: 2153-2182
        • Noh Y.K.
        • Sugiyama M.
        • Liu S.
        • Plessis M.C.D.
        • Park F.C.
        • Lee D.D.
        Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence.
        Neural Comput. 2018; 30: 1930-1960
        • Gretton A.
        • Bousquet O.
        • Smola A.
        • Schölkopf B.
        Measuring Statistical Dependence With Hilbert-Schmidt Norms.
        Springer, 2005: 63-77
      1. Song L, Smola A, Gretton A, Borgwardt KM, Bedo J. Supervised feature selection via dependence estimation. Paper presented at: Proceedings of the 24th International Conference on Machine Learning; June 20-24, 2007; Corvalis, OR. Accessed May 1, 2021. https://doi.org/10.1145/1273496.1273600

        • Johnson A.E.W.
        • Pollard T.J.
        • Shen L.
        • et al.
        MIMIC-III, a freely accessible critical care database.
        Sci Data. 2016; 3: 160035
        • Johnson A.E.W.
        • Pollard T.J.
        • Mark R.G.
        MIMIC-III clinical database (version 1.4).
        PhysioNet, 2016 (Accessed May 1, 2021. https://doi.org/10.13026/C2XW26)
        • Lundberg S.M.
        • Lee S.I.
        A unified approach to interpreting model predictions. Paper presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems; December 4-9, 2017; Long Beach, CA.
        Date: 2017
        (Accessed May 1, 2021. https://dl.acm.org/doi/10.5555/3295222.3295230)
        • Sidey-Gibbons J.A.M.
        • Sidey-Gibbons C.J.
        Machine learning in medicine: a practical introduction.
        BMC Med Res Methodol. 2019; 19 (:64)
        • Meurer W.J.
        • Tolles J.
        Logistic regression diagnostics: understanding how well a model predicts outcomes.
        JAMA. 2017; 317: 1068-1069
        • Zhang Z.
        Naïve Bayes classification in R.
        Ann Transl Med. 2016; 4: 241
        • Ben-Hur A.
        • Ong C.S.
        • Sonnenburg S.
        • Schölkopf B.
        • Rätsch G.
        Support vector machines and kernels for computational biology.
        PLoS Comput Biol. 2008; 4e1000173https://doi.org/10.1371/journal.pcbi.1000173
        • Renganathan V.
        Overview of artificial neural network models in the biomedical domain.
        Bratisl Lek Listy. 2019; 120: 536-540
        • Karalis G.
        Decision trees and applications.
        Adv Exp Med Biol. 2020; 1194: 239-242
        • Doupe P.
        • Faghmous J.
        • Basu S.
        Machine learning for health services researchers.
        Value Health. 2019; 22: 808-815
        • Abu Alfeilat H.A.
        • Hassanat A.B.A.
        • Lasassmeh O.
        • et al.
        Effects of distance measure choice on k-nearest neighbor classifier performance: a review.
        Big Data. 2019; 7: 221-248