首页> 外文OA文献 >Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles
【2h】

Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

机译:在数据挖掘中使用和扩展项目集:查询近似,密集项目集和切片

摘要

Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable.The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.
机译:频繁项集是数据挖掘中最著名的概念之一,并且对项集挖掘算法也进行了积极的研究。如果一个项目集同时出现在足够多的记录中,则该项目集在数据库中很常见。本文解决了与频繁项目集有关的两个问题。第一个问题是通过一种方法来提出的,该方法通过将包含与排除之和截断为与频繁项集相对应的项来近似逻辑查询:由此获得的近似值有多好?答案是双重的:从理论上讲,算法的最坏情况边界非常大,并且给出的结构表明边界是紧密的。但实际上,与最坏的情况相比,近似值往往更接近正确的答案。虽然其他一些基于频繁项目集的算法可以提供更好的近似值,但它们的应用范围却不那么广泛。第二个问题是扩展频繁项目集的定义以放宽完美共现的要求:高度相关的项目可能会形成一个有趣的集合,甚至如果它们从未同时出现在单个记录中。问题在于以仍然可以接受有效挖掘算法的方式来形式化这一想法。使用了两种不同的方法。首先,以类似于通常的频繁项目集的方式定义密集项目集,并且可以使用原始项目集挖掘算法的修改来找到密集项目集。其次,与频繁且密集的项目集不同,以不同的方式定义切片以形成整个数据的模型。给出了一种基于数据频谱特性的启发式算法,并探讨了其某些特性。

著录项

  • 作者

    Seppänen Jouni K.;

  • 作者单位
  • 年度 2006
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号