【24h】

Diamond in the rough

机译:钻石原石

获取原文

摘要

Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions ("heavy hitters"). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions.We identify and study two variants of HHHs for multi-dimensional data, namely the "overlap" and "split" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees.We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions.
机译:数据仓库中存档的数据项或以流形式联机到达的数据项通常具有从多个层次结构中获取值的属性(例如,时间和地理位置;源IP地址和目标IP地址)。提供此类数据的汇总视图对于汇总,可视化和分析非常重要。我们基于某些大价值区域(“重击手”)的分层组织集合来开发汇总视图。以前,这种分层的重击手(HHH)是一种重要的聚集技术,在一维中被引入。为了分析更广泛的数据仓库应用程序和现实的IP数据流,我们将此问题归纳为多个维度。我们确定并研究了多维数据的HHH的两个变体,即“重叠”和“拆分”情况,具体取决于如何将为多维层次结构中的子节点计算的聚合传播到其父元素。对于数据仓库应用程序,我们提出了离线算法,这些算法对数据进行多次传递并产生准确的HHH。对于数据流应用程序,我们提出了一种在线算法,可以在一遍中找到近似的HHH,并具有可证明的准确性保证。我们通过使用实数和合成数据实验证明了我们提出的在线算法产生的输出非常相似(在许多情况下实际上是相同的) )给他们的线下同行。在我们的在线算法中,至关重要地利用了层次尺寸(“钻石”)的乘积的晶格属性,以使用近似的HHH来跟踪近似HHH,而每个候选节点仅使用少量的固定数量的统计信息,而与尺寸的数量无关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号