首页> 外文期刊>ACM transactions on knowledge discovery from data >Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets
【24h】

Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

机译:大数据集的一次遍历线性回归中的贝叶斯变量选择

获取原文
获取原文并翻译 | 示例

摘要

Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage of MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large datasets. On the other hand, variable selection remains a challenging problem due to its combinatorial search space, where Bayesian models are a promising solution. In this work, we study how to accelerate Bayesian model computation for variable selection in linear regression. We propose a fast Gibbs sampler algorithm, a widely used MCMC method that incorporates several optimizations. We use a Zellner prior for the regression coefficients, an improper prior on variance, and a conjugate prior Gaussian distribution, which enable dataset summarization in one pass, thus exploiting an augmented set of sufficient statistics. Thereafter, the algorithm iterates in main memory. Sufficient statistics are indexed with a sparse binary vector to efficiently compute matrix projections based on selected variables. Discovered variable subsets probabilities, selecting and discarding each variable, are stored on a hash table for fast retrieval in future iterations. We study how to integrate our algorithm into a Database Management System (DBMS), exploiting aggregate User-Defined Functions for parallel data summarization and stored procedures to manipulate matrices with arrays. An experimental evaluation with real datasets evaluates accuracy and time performance, comparing our DBMS-based algorithm with the R package. Our algorithm is shown to produce accurate results, scale linearly on dataset size, and run orders of magnitude faster than the R package.
机译:贝叶斯模型通常使用马尔可夫链蒙特卡洛(MCMC)方法计算。 MCMC方法的主要缺点是它们需要对模型参数的后验分布进行采样的大量迭代,尤其是对于大型数据集。另一方面,变量选择由于其组合搜索空间而仍然是一个具有挑战性的问题,贝叶斯模型是有前途的解决方案。在这项工作中,我们研究了如何加快贝叶斯模型计算的线性回归变量选择。我们提出了一种快速的Gibbs采样器算法,该算法是一种广泛使用的MCMC方法,结合了多项优化。我们使用Zellner优先级作为回归系数,使用不正确的先验方差,以及共轭先验高斯分布,这样一来就可以进行数据集汇总,从而利用了足够多的统计量。此后,该算法在主存储器中进行迭代。使用稀疏的二进制向量对足够的统计量进行索引,以基于所选变量有效地计算矩阵投影。选择并丢弃每个变量的已发现变量子集概率存储在哈希表中,以便在将来的迭代中快速检索。我们研究如何将算法集成到数据库管理系统(DBMS)中,利用聚合的用户定义函数进行并行数据汇总,并利用存储过程来处理带有数组的矩阵。通过对真实数据集的实验评估,我们将基于DBMS的算法与R包进行了比较,从而评估了准确性和时间性能。结果表明,我们的算法可产生准确的结果,可在数据集大小上线性缩放,并且比R包运行速度快几个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号