We present a machine learning (ML) method utilizing the Medical Expenditures Panel Survey (MEPS) dataset to cluster patients into groups with similar utilization profiles characterized by numbers of utilizations of different healthcare services. By doing so, we can identify dominant utilization patterns, and assess high utilizers’ general characteristics.
Methods: Our research sample included 12,652 elderly and middle-aged adult respondents in the 2013 MEPS. The MEPS is a nationally representative survey used to collect comprehensive data on healthcare utilization and expenditures in the United States. We first used a Random Forest (RF) regression model to predict expenditures based on patients’ healthcare utilization profiles. Utilization profiles are characterized by the number of office-based, outpatient, emergency room, and inpatient visits, the number of home care days, and the number of prescription medications. As part of the construction, RF models naturally lead to a similarity measure between samples. Thus, we applied Hierarchical Agglomerative Clustering (HAC) leveraging these similarity measures to identify clusters of utilization profiles.
Results: Following ML best practice, the learned RF regression model exhibits good performance for both sub-populations, i.e., elderly (r2=0.532, nrmse=1.24) and middle-aged (r2=0.411, nrmse=2.07) adults, and the combined population (r2=0.478, nrmse=1.66). As expected, the defined healthcare utilization profile is a stronger predictor of expenditures for the elderly than the middle-aged population. The derived RF variable importance measures indicate the number of inpatient visits, physician visits, and prescription medications (in that order) are strong predictors. When examining individual variable’s prediction performance (i.e., using only that variable to train a RF model), the number of emergency room visits is strongly correlated with the expenditures for the elderly population. Further, the derived clusters (k=10) using HAC based on the RF similarity measures also provide meaningful segmentation.
Conclusions: We present a novel method, leveraging RF regression and HAC, for healthcare utilization analysis with promising results. The learned clusters can be used to understand utilization patterns of high utilizers towards a learning health system leading to better health policy making and practice.