Integrated Data Analysis for Early Warning of Lung Failure
Integrated Data Analysis for Early Warning of Lung Failure
Winner of the Geisinger Health Collider Project: Stage 2
The Outliers: Rebecca Barter and Shamindra Shrotriya
Department of Statistics, UC Berkeley
June 28, 2016
Abstract
Chronic obstructive pulmonary disease (COPD) is a major cause of mortality worldwide, with approximately 12 million adults in the U.S. having been diagnosed with COPD. Our aim is to develop methods capable of effectively predicting cases of undiagnosed COPD among those whose primary reason for hospitalization was pneumonia. Most existing algorithmic approaches to similar prediction problems focus only on utilizing clinical information. Our approach, however, aims to incorporate external environmental data sources that are not captured by the clinical records using a process called “data blending”.
We also investigate several leading supervised machine learning algorithms including Random Forest, Gradient Boosting Machines (GBM) and eXtreme Gradient Boosting (XGBoost) to improve COPD classification accuracy. We find that smoking and weather information significantly improve the predictive power of these algorithms in terms of predicting COPD among pneumonia pa-tients.
Keywords. COPD, pneumonia, random forest, Gradient Boosting Machines (GBM), eX-treme Gradient Boosting (XGBoost)
DOWNLOAD THE FULL PAPER (.PDF): report_ODBMS