Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets

N. NTARMOS; P. TRIANTAFILLOU; G. WEIKUM

首页> 外文期刊>ACM transactions on computer systems >Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets

【24h】

Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets

机译：分布式哈希草图：分布式多集的可扩展，高效且准确的基数估计

获取原文

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing "spiderman") where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.

机译：对分布式系统中的项目进行计数，尤其是估计多集的基数，对于各种各样的应用程序和新兴的Internet规模的信息系统的基本构建块而言非常重要。此类应用程序的示例范围从优化对等数据共享中的查询访问计划到计算分布式信息检索中数据项的重要性（等级/得分）。本文解决的一般形式问题是计算网络范围内具有某些属性的项目的不同数量（例如，文件名包含“ spiderman”的不同文件），其中网络中的每个节点都拥有一个任意子集，可能与其他节点。一种可行的方法必须满足的关键要求是：（1）面向非常大的网络规模的可伸缩性；（2）消息传递开销的效率；（3）存储和访问的负载平衡；（4）基数估计的准确性；以及（ 5）简单易用的应用程序集成。本文为解决此问题提供了DHS（分布式哈希草图）方法：一种分布式，可伸缩，高效且准确的多集基数估计器。 DHS基于用于概率计数的哈希草图，但是基于分布式哈希表的原则，明智地在网络节点之间分配每个计数器的位，并特别注意快速访问和聚合以及更新成本。本文讨论了各种设计选择，并在估计精度，跳数效率和负载分配公平性之间展现了可取的折衷方案。我们还将为我们所有方法的成熟，公开可用的开源实现做出贡献，并对各种环境进行全面的实验评估。

著录项

来源
《ACM transactions on computer systems》 |2009年第1期|p.49-101|共53页
作者
N. NTARMOS; P. TRIANTAFILLOU; G. WEIKUM;
展开▼
作者单位

R.A. Computer Technology Institute and Computer Engineering and Informatics Department, University of Patras, 26500 Rio, Patras, Greece;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
distributed estimation; distributed information systems; distributed cardinality estimation; distributed data summary structures; hash sketches; peer-to-peer networks and systems;

机译：分布式估计;分布式信息系统;分布式基数估计;分布式数据摘要结构;哈希草图;对等网络和系统;

相似文献

外文文献
中文文献
专利

1. Fixed Interval Nodes Estimation: An accurate and low cost algorithm to estimate the number of nodes in Distributed Hash Tables [J] . Bonnaire X. Information Sciences: An International Journal . 2013,第Null期

机译：固定间隔节点估计：一种精确且低成本的算法，用于估计分布式哈希表中的节点数
2. A gossip-based approach for Internet-scale cardinality estimation of XPath queries over distributed semistructured data [J] . Vasil Slavov, Praveen Rao The VLDB journal . 2014,第1期

机译：基于闲话的分布式半结构化数据的XPath查询的Internet规模基数估计方法
3. Distributed Hashing for Scalable Multicast in Wireless Ad Hoc Networks [J] . Das Saumitra M., Pucha Himabindu, Hu Y. Charlie IEEE Transactions on Parallel and Distributed Systems . 2008,第3期

机译：无线自组织网络中可扩展组播的分布式哈希
4. A tool for Internet-scale cardinality estimation of XPath queries over distributed semistructured data [C] . Slavov Vasil, Katib Anas, Rao Praveen IEEE international conference on data engineering . 2014

机译：用于在分布式半结构化数据上对XPath查询进行Internet规模基数估计的工具
5. PIER: Internet scale P2P query processing with distributed hash tables. [D] . Huebsch, Ryan Jay. 2008

机译：PIER：具有分布式哈希表的Internet规模P2P查询处理。
6. The Algorithms of Distributed Learning and Distributed Estimation about Intelligent Wireless Sensor Network [O] . Fuxiao Tan 2020

机译：智能无线传感器网络的分布式学习和分布式估计算法
7. Towards Intelligent Distributed Data Systems for Scalable Efficient and Accurate Analytics [O] . Peter Triantafillou 2018

机译：对于智能分布式数据系统，可用于可扩展高效和准确的分析

Distributed Hash Sketches: Scalable, Efficient, And Accurate Cardinality Estimation For Distributed Multisets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅