首页> 外文会议>International conference on very large data bases >A Performance Study of Three Disk-based Structures for Indexing and Querying Frequent Itemsets
【24h】

A Performance Study of Three Disk-based Structures for Indexing and Querying Frequent Itemsets

机译:三个基于磁盘的索引和查询频繁项目集的性能研究

获取原文
获取外文期刊封面目录资料

摘要

Frequent itemset mining is an important problem in the data mining area. Extensive efforts have been devoted to developing efficient algorithms for mining frequent itemsets. However, not much attention is paid on managing the large collection of frequent itemsets produced by these algorithms for subsequent analysis and for user exploration. In this paper, we study three structures for indexing and querying frequent itemsets: inverted files, signature files and CFP-tree. The first two structures have been widely used for indexing general set-valued data. We make some modifications to make them more suitable for indexing frequent itemsets. The CFP-tree structure is specially designed for storing frequent itemsets. We add a pruning technique based on length-2 frequent itemsets to make it more efficient for processing superset queries. We study the performance of the three structures in supporting five types of containment queries: exact match, subset/superset search and immediate subset/superset search. Our results show that no structure can outperform other structures for all the five types of queries on all the datasets. CFP-tree shows better overall performance than the other two structures.
机译:频繁的项目集挖掘是数据挖掘区域的一个重要问题。广泛的努力致力于开发用于开采频繁项目集的高效算法。但是,在管理这些算法生产的大量频繁项目集中,并没有大量关注,以便进行后续分析和用户探索。在本文中,我们研究了三种索引和查询频繁项集的结构:反向文件,签名文件和CFP树。前两个结构已广泛用于索引一般设定值数据。我们进行了一些修改,使其更适合索引频繁的项目集。 CFP树结构专门用于存储频繁的项目集。我们添加了基于长度-2频繁项集的修剪技术,使其更有效地处理超集查询。我们研究三种结构的性能支持五种类型的容纳查询:完全匹配,子集/超集搜索和立即子集/超级搜索。我们的结果表明,没有结构可以为所有数据集上所有五种类型的查询差异。 CFP-Tree显示比其他两个结构更好的整体性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号