首页> 外文期刊>ACM transactions on computer systems >Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets
【24h】

Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets

机译:分布式哈希草图:分布式多集的可扩展,高效且准确的基数估计

获取原文

摘要

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing "spiderman") where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.
机译:对分布式系统中的项目进行计数,尤其是估计多集的基数,对于各种各样的应用程序和新兴的Internet规模的信息系统的基本构建块而言非常重要。此类应用程序的示例范围从优化对等数据共享中的查询访问计划到计算分布式信息检索中数据项的重要性(等级/得分)。本文解决的一般形式问题是计算网络范围内具有某些属性的项目的不同数量(例如,文件名包含“ spiderman”的不同文件),其中网络中的每个节点都拥有一个任意子集,可能与其他节点。一种可行的方法必须满足的关键要求是:(1)面向非常大的网络规模的可伸缩性;(2)消息传递开销的效率;(3)存储和访问的负载平衡;(4)基数估计的准确性;以及( 5)简单易用的应用程序集成。本文为解决此问题提供了DHS(分布式哈希草图)方法:一种分布式,可伸缩,高效且准确的多集基数估计器。 DHS基于用于概率计数的哈希草图,但是基于分布式哈希表的原则,明智地在网络节点之间分配每个计数器的位,并特别注意快速访问和聚合以及更新成本。本文讨论了各种设计选择,并在估计精度,跳数效率和负载分配公平性之间展现了可取的折衷方案。我们还将为我们所有方法的成熟,公开可用的开源实现做出贡献,并对各种环境进行全面的实验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号