Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

CARLOS ORDONEZ; CARLOS GARCIA-ALVARADO; VEERABHADARAN BALADANDAYUTHAPANI

首页> 外文期刊>ACM transactions on knowledge discovery from data >Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

【24h】

Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

机译：大数据集的一次遍历线性回归中的贝叶斯变量选择

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage of MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large datasets. On the other hand, variable selection remains a challenging problem due to its combinatorial search space, where Bayesian models are a promising solution. In this work, we study how to accelerate Bayesian model computation for variable selection in linear regression. We propose a fast Gibbs sampler algorithm, a widely used MCMC method that incorporates several optimizations. We use a Zellner prior for the regression coefficients, an improper prior on variance, and a conjugate prior Gaussian distribution, which enable dataset summarization in one pass, thus exploiting an augmented set of sufficient statistics. Thereafter, the algorithm iterates in main memory. Sufficient statistics are indexed with a sparse binary vector to efficiently compute matrix projections based on selected variables. Discovered variable subsets probabilities, selecting and discarding each variable, are stored on a hash table for fast retrieval in future iterations. We study how to integrate our algorithm into a Database Management System (DBMS), exploiting aggregate User-Defined Functions for parallel data summarization and stored procedures to manipulate matrices with arrays. An experimental evaluation with real datasets evaluates accuracy and time performance, comparing our DBMS-based algorithm with the R package. Our algorithm is shown to produce accurate results, scale linearly on dataset size, and run orders of magnitude faster than the R package.

机译：贝叶斯模型通常使用马尔可夫链蒙特卡洛（MCMC）方法计算。 MCMC方法的主要缺点是它们需要对模型参数的后验分布进行采样的大量迭代，尤其是对于大型数据集。另一方面，变量选择由于其组合搜索空间而仍然是一个具有挑战性的问题，贝叶斯模型是有前途的解决方案。在这项工作中，我们研究了如何加快贝叶斯模型计算的线性回归变量选择。我们提出了一种快速的Gibbs采样器算法，该算法是一种广泛使用的MCMC方法，结合了多项优化。我们使用Zellner优先级作为回归系数，使用不正确的先验方差，以及共轭先验高斯分布，这样一来就可以进行数据集汇总，从而利用了足够多的统计量。此后，该算法在主存储器中进行迭代。使用稀疏的二进制向量对足够的统计量进行索引，以基于所选变量有效地计算矩阵投影。选择并丢弃每个变量的已发现变量子集概率存储在哈希表中，以便在将来的迭代中快速检索。我们研究如何将算法集成到数据库管理系统（DBMS）中，利用聚合的用户定义函数进行并行数据汇总，并利用存储过程来处理带有数组的矩阵。通过对真实数据集的实验评估，我们将基于DBMS的算法与R包进行了比较，从而评估了准确性和时间性能。结果表明，我们的算法可产生准确的结果，可在数据集大小上线性缩放，并且比R包运行速度快几个数量级。

著录项

来源
《ACM transactions on knowledge discovery from data》 |2014年第1期|3.1-3.14|共14页
作者
CARLOS ORDONEZ; CARLOS GARCIA-ALVARADO; VEERABHADARAN BALADANDAYUTHAPANI;
展开▼
作者单位

University of Houston;

University of Houston;

UT MD Anderson Cancer Center;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Sufficient statistics; variable selection; on-line algorithm; MCMC; Gibbs sampler;

机译：足够的统计数据;变量选择;在线算法;MCMC;吉布斯采样器;

相似文献

外文文献
中文文献
专利

1. BayesSUR: An R Package for High-Dimensional Multivariate Bayesian Variable and Covariance Selection in Linear Regression [J] . Zhi Zhao, Marco Banterle, Leonardo Bottolo, Journal of Statistical Software . 2021,第11期

机译：Bayessur：用于线性回归的高维多元贝叶斯变量和协方差选择的R包
2. A novel Bayesian approach for variable selection in linear regression models [J] . Computational statistics & data analysis . 2020,第期

机译：线性回归模型中变量选择的新型贝叶斯方法
3. Bayesian quantile regression and variable selection for partial linear single-index model: Using free knot spline [J] . Yu Yang, Zou Zhihong, Wang Shanshan Communications in Statistics . 2019,第3a5期

机译：部分线性单指标模型的贝叶斯分位数回归和变量选择：使用自由结样条
4. Bayesian Variable Selection for Multi-response Linear Regression [C] . Wan-Ping Chen, Ying Nian Wu, Ray-Bin Chen Technologies and applications of artificial intelligence . 2014

机译：多响应线性回归的贝叶斯变量选择
5. An Information Based Optimal Subdata Selection Algorithm for Big Data Linear Regression and a Suitable Variable Selection Algorithm. [D] . Zheng, Yi. 2017

机译：大数据线性回归的基于信息的最优子数据选择算法和合适的变量选择算法。
6. Query Large Scale Microarray Compendium Datasets Using a Model-Based Bayesian Approach with Variable Selection [O] . Ming Hu, Zhaohui S. Qin 2009

机译：使用具有变量选择的基于模型的贝叶斯方法查询大规模微阵列纲要数据集
7. Variable Selection in a Bayesian Linear Regression Model via Generalized Bayesian Information Criterion [O] . KABE Satoshi, KANAZAWA Yuichiro 2014

机译：基于广义贝叶斯信息准则的贝叶斯线性回归模型中的变量选择

Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅