...
首页> 外文期刊>Data Mining and Knowledge Discovery >A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space
【24h】

A novel hash-based approach for mining frequent itemsets over data streams requiring less memory space

机译:一种基于散列的新颖方法,可在需要较少存储空间的数据流上挖掘频繁项集

获取原文
获取原文并翻译 | 示例
           

摘要

In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.
机译:近年来,在许多应用程序中,数据以连续数据流的形式生成。由于处理数据流是必要的,并且发现数据流背后的知识通常可以带来很多好处,因此对数据流进行挖掘已成为最重要的问题之一。已经提出了许多用于通过数据流挖掘频繁项集的方法。这些方法通常由两个过程组成,包括连续维护数据流的提要和从提要中找到频繁的项集。但是,大多数方法都假定数据流的提要可以保存在内存中,而忽略了提要中保留的非频繁项目集的信息可能导致内存利用率显着下降这一事实。在本文中,我们考虑使用基于哈希的技术将所有项目集的信息压缩到具有固定大小的结构中。这种基于散列的方法通过使用散列表巧妙地总结了整个数据流的信息,提供了一种新颖的技术来估计非频繁项目集的支持计数,并且仅保留频繁项目集以加快挖掘过程。因此,可以达到优化存储器空间利用率的目的。提出了该方法的正确性保证,错误分析和参数设置,并进行了一系列实验以证明该方法的有效性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号