首页> 外文期刊>Statistics and computing >Learning Bayesian networks from big data with greedy search: computational complexity and efficient implementation
【24h】

Learning Bayesian networks from big data with greedy search: computational complexity and efficient implementation

机译:使用贪婪搜索的大数据学习贝叶斯网络:计算复杂性和高效实施

获取原文
获取原文并翻译 | 示例

摘要

Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in high-dimensional applications ("small n, large p") in systems biology and genetics. More recently, data sets with large numbers of observations (the so-called "big data") have become increasingly common; and these data sets are not necessarily high-dimensional, sometimes having only a few tens of variables depending on the application. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of score-based algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed-form estimators for local distributions with few parents. Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves speed; and we confirm that it improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman (Stat Comput 10: 55-62, 2000). We demonstrate these results on large real-world environmental and epidemiological data; and on reference data sets available from public repositories.
机译:了解从数据的贝叶斯网络的结构是一个计算上的挑战,np难题。在系统生物学和遗传学中的一般兴趣之后,文献已经长期调查了如何从包含大量变量的数据进行结构学习。在系统生物学和遗传学中的“小型n,大p”)。最近,具有大量观察的数据集(所谓的“大数据”)变得越来越普遍;这些数据集不一定是高维的,有时根据应用程序仅具有几十个变量。我们重新审视贝叶斯网络结构在此设置中学习的计算复杂性,显示使用估计的本地分布的数量测量的常见选择导致了对最常见的基于算法,​​贪婪搜索的不切实际的时间复杂性估计。然后我们在共同的分布假设下获得更准确的表达。这些表达表明,通过利用少数家长的局部分布的封闭形式估计的可用性,可以提高贝叶斯网络学习的速度。此外,我们发现使用预测而不是样本的拟合良好分数提高了速度;我们确认它也提高了网络重建的准确性,如先前通过巧克力和Heckerman观察到的(统计计算机10:55-62,2000)。我们展示了大型现实世界环境和流行病学数据的结果;以及公共存储库可获得的参考数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号