...
首页> 外文期刊>Statistics and computing >Learning Bayesian networks from big data with greedy search: computational complexity and efficient implementation
【24h】

Learning Bayesian networks from big data with greedy search: computational complexity and efficient implementation

机译:通过贪婪搜索从大数据中学习贝叶斯网络:计算复杂性和有效实现

获取原文
获取原文并翻译 | 示例
           

摘要

Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in high-dimensional applications ("small n, large p") in systems biology and genetics. More recently, data sets with large numbers of observations (the so-called "big data") have become increasingly common; and these data sets are not necessarily high-dimensional, sometimes having only a few tens of variables depending on the application. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of score-based algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed-form estimators for local distributions with few parents. Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves speed; and we confirm that it improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman (Stat Comput 10: 55-62, 2000). We demonstrate these results on large real-world environmental and epidemiological data; and on reference data sets available from public repositories.
机译:从数据中学习贝叶斯网络的结构被认为是一个计算上有挑战性的NP难题。长期以来,在系统生物学和遗传学领域对高维应用(“小n,大p”)的普遍关注之后,文献一直在研究如何从包含大量变量的数据中进行结构学习。最近,具有大量观测值的数据集(所谓的“大数据”)变得越来越普遍。而且这些数据集不一定是高维的,有时取决于应用程序可能只有几十个变量。我们在这种情况下重新审视了贝叶斯网络结构学习的计算复杂性,这表明使用最常见的基于分数的算法(贪婪搜索)以估计的局部分布数量来测量它的常见选择会导致不切实际的时间复杂性估计。然后,我们在常见的分布假设下得出更准确的表达式。这些表达式表明,通过利用封闭式估计量对父母很少的本地分布的可用性,可以提高贝叶斯网络学习的速度。此外,我们发现使用预测性而不是样本内拟合优度得分可以提高速度。正如Chickering和Heckerman先前所观察到的(Stat Comput 10:55-62,2000),并且我们确认它也提高了网络重建的准确性。我们在大量的现实世界环境和流行病学数据上证明了这些结果;以及可从公共存储库获得的参考数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号