Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

Published by Daniel at June 1, 2023

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

JAMA Network Open – Public Health

Importance

Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.

Objectives

Overview of our approach

Design, Setting...

This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient-boosting decision tree model was trained on data from 1,657,395 patients, validated on 243,442 patients, and tested on 236,506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.

Exposure

A random sample of 2,137,343 residents of Ontario without type 2 diabetes was obtained at the study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.

Outcomes and Measures...

Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.

Results

This study trained a gradient-boosting decision tree model on data from 1,657,395 patients (12,900 257 instances; 6,666,662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.

Model Prediction RIsk Level

In the first setup, we rank patients by their model's output in decreasing order, then bin them into 4 categories: top 1%, next 5% (between top 1% and top 6%), next 15% (between top 6% and top 21%), and the remaining 79%. For each bin, we display statistics pertaining to general demographic factors (mean age, fraction of women, fraction of immigrants, and time in Canada for immigrants) and socioeconomic factors (race/ethnicity and deprivation marginalization scores of the neighborhood), as well as the mean HbA1c. Means are computed across non-missing values from patients within each bin. For instance, time in Canada is computed only for immigrants of each model output bin, as the value is missing for long-term residents. The second setup evaluates the same variables but when splitting patients according to their label (positive or negative).

Conclusions

In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

Relevance to Healthcare Field

According to the latest Global Burden of Disease (GBD) dataset, 462 million individuals are affected by type 2 diabetes (6.28% of the world’s population). Incidence and prevalence are increasing, inflicting a heavy burden on the healthcare system. With the machine learning model developed in this study, the system was able to predict the disease onset in 5 years by the routine collection of health data every three months. Even though the model did not use the body mass index (BMI), it showed consistent calibration across age, sex, immigration status, race/ethnicity, and material deprivation to determine accurately if the patient would develop the disease or not. Among the subsets studied, immigration and material deprivation demonstrated to be associated with an increased risk of developing type 2 diabetes. The machine learning model is a cost-effective method that has the advantages of improving disease detection, preventing disease development, and decreasing the burden for the patient and the overall healthcare system.