首页> 外文会议>International Conference on Intelligent Data Engineering and Automated Learing(IDEAL 2007); 20071216-19; Birmingham(GB) >Mining Frequent Itemsets in Large Data Warehouses: A Novel Approach Proposed for Sparse Data Sets
【24h】

Mining Frequent Itemsets in Large Data Warehouses: A Novel Approach Proposed for Sparse Data Sets

机译:大型数据仓库中频繁项目集的挖掘:稀疏数据集的一种新方法

获取原文
获取原文并翻译 | 示例

摘要

Proposing efficient techniques for discovery of useful information and valuable knowledge from very large databases and data warehouses has attracted the attention of many researchers in the field of data mining. The well-known Association Rule Mining (ARM) algorithm, Apriori, searches for frequent itemsets (i.e., set of items with an acceptable support) by scanning the whole database repeatedly to count the frequency of each candidate itemset. Most of the methods proposed to improve the efficiency of the Apriori algorithm attempt to count the frequency of each itemset without re-scanning the database. However, these methods rarely propose any solution to reduce the complexity of the inevitable enumerations that are inherited within the problem. In this paper, we propose a new algorithm for mining frequent itemsets and also association rules. The algorithm computes the frequency of itemsets in an efficient manner. Only a single scan of the database is required in this algorithm. The data is encoded into a compressed form and stored in main memory within a suitable data structure. The proposed algorithm works in an iterative manner, and in each iteration, the time required to measure the frequency of an itemset is reduced further (i.e., checking the frequency of n-dimensional candidate itemsets is much faster than those of n-1 dimensions). The efficiency of our algorithm is evaluated using artificial and real-life datasets. Experimental results indicate that our algorithm is more efficient than existing algorithms.
机译:提出从大型数据库和数据仓库中发现有用信息和有价值知识的有效技术引起了数据挖掘领域许多研究人员的关注。众所周知的关联规则挖掘(ARM)算法Apriori通过重复扫描整个数据库以计算每个候选项目集的频率来搜索频繁的项目集(即具有可接受支持的项目集)。为提高Apriori算法的效率而提出的大多数方法都尝试在不重新扫描数据库的情况下对每个项目集的频率进行计数。但是,这些方法很少提出解决方案,以减少问题中继承的不可避免的枚举的复杂性。在本文中,我们提出了一种用于挖掘频繁项集以及关联规则的新算法。该算法以有效的方式计算项目集的频率。此算法仅需要对数据库进行一次扫描。数据被编码为压缩形式,并存储在合适数据结构内的主存储器中。所提出的算法以迭代方式工作,并且在每次迭代中,测量项目集频率所需的时间进一步减少(即,检查n维候选项目集的频率比n-1维项目的频率快得多) 。我们使用人工和现实数据集评估了我们算法的效率。实验结果表明,我们的算法比现有算法更有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号