首页> 外文OA文献 >Extending data mining techniques for frequent pattern discovery trees, low-entropy sets, and crossmining
【2h】

Extending data mining techniques for frequent pattern discovery trees, low-entropy sets, and crossmining

机译:扩展数据挖掘技术以用于频繁的模式发现树,低熵集和交叉挖掘

摘要

The idea of frequent pattern discovery is to find frequently occurring events in large databases. Such data mining techniques can be useful in various domains. For instance, in recommendation and e-commerce systems frequently occurring product purchase combinations are essential in user preference modeling. In the ecological domain, patterns of frequently occurring groups of species can be used to reveal insight into species interaction dynamics. Over the past few years, most frequent pattern mining research has concentrated on efficiency (speed) of mining algorithms. However, it has been argued within the community that while efficiency of the mining task is no longer a bottleneck, there is still an urgent need for methods that derive compact, yet high quality results with good application properties. The aim of this thesis is to address this need. The first part of the thesis discusses a new type of tree pattern class for expressing hierarchies of general and more specific attributes in unstructured binary data. The new pattern class is shown to have advantageous properties, and to discover relationships in data that cannot be expressed alone with the more traditional frequent itemset or association rule patterns. The second and third parts of the thesis discuss the use of entropy as a score measure for frequent pattern mining. A new pattern class is defined, low-entropy sets, which allow to express more general types of occurrence structure than with frequent itemsets. The concept can also be easily applied to tree types of pattern. Furthermore, by applying minimum description length in pattern selection for low-entropy sets it is shown experimentally that in most cases the collections of selected patterns are much smaller than by using frequent itemsets. The fourth part of the thesis examines the idea of crossmining itemsets, that is, relating itemsets to numerical variables in a database of mixed data types. The problem is formally defined and turns out to be NP-hard, although it is approximately solvable within a constant-factor of the optimum solution. Experiments show that the algorithm finds itemsets that convey structure in both the binary and the numerical part of the data.
机译:频繁模式发现的想法是在大型数据库中发现频繁发生的事件。这样的数据挖掘技术在各个领域中可能是有用的。例如,在推荐和电子商务系统中,频繁出现的产品购买组合对于用户偏好建模至关重要。在生态学领域,可以将频繁出现的物种群体的模式用于揭示对物种相互作用动力学的了解。在过去的几年中,最频繁的模式挖掘研究集中在挖掘算法的效率(速度)上。但是,在社区内部一直争论着,尽管采矿任务的效率不再是瓶颈,但仍迫切需要能够获得紧凑但高质量结果并具有良好应用特性的方法。本文的目的是解决这一需求。论文的第一部分讨论了一种新的树模式类,用于表示非结构化二进制数据中的通用属性和更特定属性的层次结构。新的模式类别显示出具有有利的属性,并发现了无法用更传统的频繁项目集或关联规则模式单独表达的数据中的关系。论文的第二部分和第三部分讨论了将熵用作频繁模式挖掘的评分方法。定义了一个新的模式类,即低熵集,它可以表示比频繁项集更通用的出现结构类型。该概念还可以轻松地应用于树型图案。此外,通过在用于低熵集的模式选择中应用最小描述长度,实验表明,在大多数情况下,与使用频繁项集相比,所选模式的集合要小得多。论文的第四部分探讨了交叉挖掘项目集的想法,即将项目集与混合数据类型数据库中的数字变量相关联。尽管可以在最佳解决方案的恒定因数内将其解决,但该问题已正式定义为NP难题。实验表明,该算法找到了在数据的二进制和数字部分都传达结构的项目集。

著录项

  • 作者

    Heikinheimo Hannes;

  • 作者单位
  • 年度 2010
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号