Improving Models to Predict Care Utilization Using Machine Learning: Retrospective Observational Study

Abstract

Background

The use of artificial intelligence and machine learning (ML) tools is now common in the advancement of health care services and clinical risk estimation. Legacy systems make use of highly informative feature sets developed from years of clinical expertise and research to estimate different outcomes, but only recently have they been tested against novel statistical approaches. One such system, the Johns Hopkins Adjusted Clinical Group (ACG) System, is a long-standing and widely used approach to categorizing clinical risk factors, and it is amenable to ML techniques.

Objective

This study aims to test the ACG System using a contrasted area under the receiver operating characteristic (AUROC) and F₁ classification optimization strategy and compare its performance against traditional logistic regression methods. Assuming that selected ML algorithms can be tuned to enhance overall measures of performance, this would strengthen arguments for incorporating them into ACG-related workflows.

Methods

Using a retrospective observational design, prospective year estimates of all-cause hospitalization and elevated total cost were modeled using a cross-validation framework. Patients with elevated costs were identified as those falling above the 95th percentile of total amounts billed, including pharmacy costs. Hyperparameter settings for XGBoost (Extreme Gradient Boosting), random forest, and elastic net were determined using average cross-validated performances for F₁ and AUROC in a grid search aimed at maximizing either statistic. Additional iterated cross-validation was used to compare point-estimated average AUROC and F₁-scores between models, further decomposed by sensitivity, positive predictive value, and F-beta statistics.

Results

There were 350,463 patients selected in 2019 from the Johns Hopkins Health System. Model features identified by the ACG System for predicting prospective year hospitalization and total cost were included in these analyses. Findings suggest small but statistically significant improvements in cross-validated AUROC and F₁-scores over logistic regression, using either optimization strategy and XGBoost. Logistic models achieved average receiver operating characteristic values of 0.886 and 0.841 for cost and hospitalization, respectively, whereas XGBoost achieved 0.891 and 0.849, respectively. F₁ optimization yielded similar findings, with logistic models achieving 0.367 and 0.341 on average for hospitalization and cost, respectively, but XGBoost exceeded values for cost but not for hospitalization (0.411 and 0.328, respectively).

Conclusions

The clinical implications of these findings and the effect of class imbalance on model calibration are explored, along with the limitations of these data and approach. The core finding is that logistic regression remains very well-suited to these tasks, especially in situations where the efficiency or interpretability of models is critical. Under conditions of imbalance, regressions tended to yield high-precision estimates for the outnumbered class. Nevertheless, the findings also underscore a diversity of suitable models depending on clinical use cases, each having its own tradeoffs for evaluating performance. As such, health systems must clearly identify the needs and expectations of a model before calibrating one for use.

artificial intelligence,machine learning,logistic regression

Documents