首页> 外文会议>2010 IEEE International Conference on Cluster Computing >Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
【24h】

Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases

机译:并行计算权变统计:设计权衡和极限案例

获取原文
获取外文期刊封面目录资料

摘要

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.
机译:统计分析通常用于降低数据的维度和推断意义。任何针对大规模的统计分析包的关键挑战是解决并行可扩展性和数值稳定性的正交问题。许多统计技术,例如描述性统计或主成分分析,基于时刻和共同的矩,并且可以使用鲁棒在线更新公式,可以以令人尴尬的平行方式计算,可用于地图 - 减少样式实现。在本文中,我们专注于应急表,通过哪些导出的统计数据,如关节和边际概率,点亮互信息,信息熵和C2独立统计数据可以直接获得。但是,随着数据大小的增加,应急表可能变大,需要相应大量的处理器之间的通信。这种通信的潜在增加可防止最佳并行加速,并且是与基于时刻的统计数据(我们在[1]中讨论的主要区别,其中处理器间通信的量无关。在这里,我们介绍了我们在并行实施了对偶然表的计算的设计权衡。我们还研究了我们开源实现的并行加速和可扩展性属性。特别是,当在适当的上下文中使用累积情况统计时,我们观察到最佳加速和可扩展性,即,当数据输入不是准漫反射时。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号