首页> 外文会议>IEEE International Conference on e-Science >DOT-K: A distributed online top-K elements algorithm using extreme value statistics
【24h】

DOT-K: A distributed online top-K elements algorithm using extreme value statistics

机译:DOT-K:使用极值统计的分布式在线top-K元素算法

获取原文

摘要

Extremely large (peta-scale) data collections are generally partitioned into millions of containers (disks/volumes/files) which are essentially unmovable due to their aggregate size. They are stored over a large distributed cloud of machines, with computing co-located with the data. Given this data layout, even simple tasks are difficult to perform and naive algorithms can easily become quite expensive. We present a one pass, communications-efficient technique useful for both estimating upper order quantiles and selecting the largest k elements across a highly distributed dataset or stream. Our novel approach draws its foundations from Extreme Value Statistics (EVS) to reason about the statistical relationships between the tail distributions of dataset partitions. The tail of each partition is fitted by the Generalized Pareto Distribution, which captures threshold exceedances. The obtained parameters are communicated to a central coordinator and used to estimate quantiles, or solve for a threshold above which there are approximately k elements. We discuss the computational and bandwidth costs of the algorithm, and demonstrate the accuracy of the method on both a variety of synthetic datasets and a PageRank dataset.
机译:通常将超大型(peta规模)的数据集合划分为数百万个容器(磁盘/卷/文件),这些容器由于其合计大小而基本上无法移动。它们存储在大型分布式计算机云中,并且计算与数据位于同一位置。在这种数据布局的情况下,即使是简单的任务也难以执行,而且幼稚的算法也很容易变得非常昂贵。我们提出一种单程,通信高效的技术,该技术可用于估计高阶分位数并在高度分布的数据集或流中选择最大的k个元素。我们的新颖方法从极值统计(EVS)的基础上推论出数据集分区的尾部分布之间的统计关系。每个分区的尾部由广义帕累托分布拟合,该分布捕获阈值超出量。所获得的参数被传送到中央协调器,并用于估计分位数,或求解一个阈值,在该阈值之上大约有k个元素。我们讨论了该算法的计算和带宽成本,并论证了该方法在各种综合数据集和PageRank数据集上的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号