首页> 外文期刊>Data Mining and Knowledge Discovery >Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis
【24h】

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

机译:通过持续保持全局概要在分布式数据流上挖掘频繁项集

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.
机译:近年来,通过数据流挖掘频繁项集引起了很多研究关注。过去,我们开发了一种基于哈希的方法来在单个数据流上挖掘频繁的项目集。在本文中,我们将这种方法扩展为从分布在不同远程站点的数据流集合中挖掘全球频繁项目集。为了加快挖掘过程,我们首次尝试解决一个新问题,即不断维护所有分布式流的联合的全局提要。因此,可以通过直接处理维护的全局大纲来按需获得挖掘结果。代替在中央服务器中收集和处理所有数据(这可能浪费远程站点的计算资源),而是执行数据流上的分布式计算。提出了一种分布式计算框架,包括两种通信策略和一种合并操作。这些通信策略是根据挖掘结果的准确性保证而设计的,确定了远程站点何时以及向远程服务器(称为协调器)传输什么内容。另一方面,利用合并操作将从远程站点接收的信息合并到在协调器中维护的全局概要中。通过策略和操作,可以实现持续保持全局概要的目标。植根于持续保持的全局概要中,我们提出了一种用于查找全局频繁项集的挖掘算法。此外,还提供了通信策略和合并操作的正确性保证,以及挖掘算法的准确性保证分析。最后,对合成数据集和真实数据集进行了一系列实验,以证明分布式计算框架的有效性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号