Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Elizabeth Handorf; Yinuo Yin; Michael Slifker; Shannon Lynch

摘要

Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p?=?14,663 variables) linked to prostate cancer registry data (n?=?76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.

机译：从美国人口普查获得的社会环境数据是了解健康差异的重要资源，但很少是用于分析的完整数据集。结合完整数据的障碍是对可变选择的缺乏持续的建议，研究人员经常手中选择几个变量。因此，我们评估了经验机器学习方法的能力，以确定具有与健康结果真实联系的社会环境因素。我们比较了几种流行的机器学习方法，包括惩罚回归（例如套索，弹性网）和树合奏方法。通过模拟，我们评估了方法识别与二进制和连续结果真正关联的人口普查变量的能力，同时最大限度地减少假阳性结果（10个真实关联，1000个总变量）。我们将最有希望的方法应用于与前列腺癌登记数据（N？= 76,186个案例）相关的完整人口普查数据（P？= 14,663变量），以确定与晚期前列腺癌相关的社会环境因素。在仿真中，我们发现弹性网识别出许多真正的变量，而套索提供了良好的误报控制。使用综合测量的准确性，基于Spearman与稀疏组回归的相关性的分层群集是最优越的。 Bayesian Adaptive Resollion Trees优于其他树合奏方法，但不是稀疏的套索。在完整的数据集中，稀疏组套索成功标识了变量的子集，其中三个复制了前面的发现。该分析证明了经验机器学习方法的潜力，以确定具有与结果的真实关联的人口普查变量的小子集，并跨越经验方法复制。稀疏聚类回归模型最佳，因为它们识别出许多真正的正变量，同时控制错误的正面发现。

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

摘要

著录项

相关主题

期刊订阅