首页> 外文学位 >Graph based methods for pattern mining.
【24h】

Graph based methods for pattern mining.

机译:基于图的模式挖掘方法。

获取原文
获取原文并翻译 | 示例

摘要

Pattern mining is the automatic extraction of novel and potentially useful knowledge from large databases. It plays a major role in many applications, such as intrusion detection, business intelligence, and bio-medical applications. This thesis explores two important pattern mining tasks, namely, frequent pattern mining and anomalous pattern mining. We illustrate the limitations of existing pattern mining algorithms and develop a class of algorithms inspired by graph theoretic principles to overcome these limitations.; First, we demonstrate characteristics of existing frequent pattern mining algorithms that prohibit them from scaling-up to very large databases. We introduce a novel data structure known as a PrefixGraph to compress crucial information about the database in order to facilitate fast enumeration of frequent patterns. An efficient pattern search and pruning algorithm called PGMiner was also developed using principles derived from network flow analysis.; Second, we investigate a class of anomalies that are difficult to distinguish from normal observations, resulting in high false alarm rates in many anomaly detection algorithms. To address this problem, we introduce OutRank, a stochastic graph based algorithm for unsupervised anomaly detection. In this method, a graph is constructed using two approaches: one based on the similarity between objects and the other using shared neighbors of the object. The heart of this approach uses a Markov chain random walk model to determine the anomaly score of an object.; Recent years have also witnessed the proliferation of graph-based data, spurred by advances in application areas such as bioinformatics, chem-informatics, Web 2.0, and sensor networks. In this thesis, we systematically develop an anomaly detection framework to detect anomalies in graph-based data. First, frequent subgraph patterns are extracted from the data. We investigate two methods for utilizing the frequent subgraph patterns in graph anomaly detection. The first approach, gRank, is an extension of OutRank to graph-based data. It uses a substructure-based similarity measure to create a stochastic graph and then performs random walk to determine the anomaly score of a graph. The second approach, called gEntropy, creates features based on the frequent subgraph patterns and builds a probability model for the graphs using the maximum entropy principle. Anomalous graphs are detected based on the probability that such a graph is generated by the model. Finally, we extend the maximum entropy approach to supervised anomaly detection and show that it produces significant improvements over the unsupervised case.; Graph based methods have been applied in several areas such as machine learning and networking. In this thesis, we show how graph based methods can be applied successfully in pattern mining area. More specifically, we show how graph theoretic approaches can be utilized to improve the efficiency and the effectiveness of frequent item- set mining and anomaly detection. With the advances in technology, a huge amount of data is accumulated everyday and database sizes have grown to Terabyte and Petabyte scale. We believe the algorithms developed in this thesis are a step forward towards making pattern mining more effective for such large and complex data sets.
机译:模式挖掘是从大型数据库中自动提取新颖且可能有用的知识。它在许多应用程序中扮演重要角色,例如入侵检测,商业智能和生物医学应用程序。本文探讨了两个重要的模式挖掘任务,即频繁模式挖掘和异常模式挖掘。我们说明了现有模式挖掘算法的局限性,并开发了一种基于图论原理的算法来克服这些局限性。首先,我们演示了现有的频繁模式挖掘算法的特征,这些算法禁止它们扩展到非常大的数据库。我们引入一种称为PrefixGraph的新颖数据结构来压缩有关数据库的关键信息,以便于快速枚举频繁模式。还使用从网络流分析得出的原理开发了一种有效的模式搜索和修剪算法PGMiner。其次,我们研究了一类异常,这些异常很难与正常观察结果区分开,从而导致许多异常检测算法中的误报率很高。为了解决这个问题,我们介绍了OutRank,这是一种基于随机图的无监督异常检测算法。在这种方法中,使用两种方法构造图:一种基于对象之间的相似性,另一种使用对象的共享邻居。这种方法的核心是使用马尔可夫链随机游走模型来确定对象的异常得分。近年来,在生物信息学,化学信息学,Web 2.0和传感器网络等应用领域的发展推动下,基于图形的数据激增。在本文中,我们系统地开发了一个异常检测框架来检测基于图的数据中的异常。首先,从数据中提取频繁的子图模式。我们研究了两种在图异常检测中利用频繁子图模式的方法。第一种方法gRank是OutRank对基于图形的数据的扩展。它使用基于子结构的相似性度量来创建随机图,然后执行随机游动以确定图的异常得分。第二种方法称为gEntropy,它基于频繁的子图模式创建特征,并使用最大熵原理为图建立概率模型。基于模型产生这种图的可能性来检测异常图。最后,我们将最大熵方法扩展到有监督的异常检测,并表明它在无监督的情况下产生了显着的改进。基于图的方法已应用于多个领域,例如机器学习和网络。在本文中,我们展示了基于图的方法如何成功地应用于模式挖掘领域。更具体地说,我们展示了如何利用图论方法来提高频繁项集挖掘和异常检测的效率和有效性。随着技术的进步,每天都会积累大量数据,并且数据库大小已增长到TB和PB。我们认为,本文开发的算法是朝着使模式挖掘对于如此大而复杂的数据集更加有效的方向迈出的一步。

著录项

  • 作者

    Moonesinghe, H. D. K.;

  • 作者单位

    Michigan State University.;

  • 授予单位 Michigan State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 137 p.
  • 总页数 137
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号