首页> 外文期刊>ACM transactions on knowledge discovery from data >Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams
【24h】

Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

机译:分层采样:在大规模图形流中计算稀疏图案的有效方法

获取原文

摘要

We introduce TIERED SAMPLING, a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges.Our methods address the challenging task of counting sparse motifs-sub-graph patterns-that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate.While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing TIERED SAMPLING to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest.We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.
机译:我们引入了分层采样,一种用于估计在流中观察到的大规模图中稀疏图案计数的新技术。我们的技术只需要一个传递的数据,并使用固定大小m的内存,这可以是小于边缘数量的大小。我们方法解决了计数稀疏图案 - 子图形模式的具有挑战性的任务 - 具有低出现在图中的M边缘样本中的概率,这是每个步骤中算法可用的最大数据量。为了获得计数的无偏见和低方差估计,我们将可用存储器分区为储层样本的层(层)。虽然基层是边缘的标准储存样品,但其他层是所需基序的副结构的储层样本。通过存储MOTIF的更频繁的子结构,我们增加了检测我们计数的稀疏图案的发生的概率,从而降低了估计的差异和误差。我们专注于计算4的算法的设计和分析-Cliques,我们提出了一种方法,允许概括分层采样,以获得对感兴趣的任何子图的发生次数的高质量估计,同时由于感兴趣模式的特定属性而降低了分析工作。我们呈现完整综合性和现实世界数据提出方法的分析分析与广泛实验评价。我们的结果展示了我们在使用非常有限的内存的大图中获得4和5个派系的高质量近似的方法的优势,显着优于大规模图中计数稀疏图案的单个边缘样本方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号