首页> 外文学位 >Mining, indexing and similarity search in large graph data sets.
【24h】

Mining, indexing and similarity search in large graph data sets.

机译:大型图数据集中的挖掘,索引和相似性搜索。

获取原文
获取原文并翻译 | 示例

摘要

Scalable analytical algorithms and tools for large graph data sets are in great demand across domains from software engineering to computational biology as it is very difficult, if not impossible, for human beings to manually analyze any reasonably large collection of graphs due to their high complexity. In this dissertation, we investigate two long standing fundamental problems: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph pattern mining is an expensive computational problem since subgraph isomorphism is NP-complete. Previous solutions generate inevitable overheads since they rely on joining two graphs to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that this kind of join operation is unnecessary. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a novel, frequent and discriminative graph mining approach that leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches.; Besides graph mining and search, this dissertation provides thorough investigation of pattern summarization, pattern-based classification, constraint pattern mining, and graph similarity searching, which could leverage the usage of graph patterns. It also explores several critical applications in bioinformatics, computer systems and software engineering, including gene relevance network analysis for functional annotation, and program flow analysis for automated software bug isolation.; The developed concepts, theories, and systems may significantly deepen the understanding of data mining principles in structural pattern discovery, interpretation and search. The formulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications in multiple domains.
机译:从软件工程到计算生物学,跨领域对大型图数据集的可伸缩分析算法和工具提出了很高的要求,因为人们很难(即使不是不可能)手动分析任何相当大的图集合,因为它们具有很高的复杂性。在本文中,我们研究了两个长期存在的基本问题:给定一个图数据集,隐藏的结构模式是什么?如何找到它们?以及如何在大型图形数据集中索引图形并执行相似性搜索?由于子图同构是NP完全的,因此图形模式挖掘是一个昂贵的计算问题。先前的解决方案会产生不可避免的开销,因为它们依赖于将两个图连接起来以形成更大的候选对象。我们开发了图形规范标注系统gSpan,从理论上和经验上都表明这种联接操作是不必要的。如果将图数据库中的所有子结构都用于索引,则图索引是本文要解决的第二个问题,它可能会导致指数条目的指数增加。 gIndex解决方案提出了一种新颖,频繁且有区别的图挖掘方法,该方法导致了紧凑但有效的图索引结构的发展,该结构的大小比传统方法小几个数量级,但性能却快一个数量级。除了图形挖掘和搜索之外,本文还对模式摘要,基于模式的分类,约束模式挖掘和图形相似性搜索进行了深入研究,可以利用图形模式的使用。它还探索了生物信息学,计算机系统和软件工程中的几个关键应用,包括用于功能注释的基因相关网络分析和用于自动软件错误隔离的程序流分析。发达的概念,理论和系统可能会大大加深对结构模式发现,解释和搜索中数据挖掘原理的理解。通过这项研究,通用图形信息系统的制定可以为多个领域的图形密集型应用程序提供基本支持。

著录项

  • 作者

    Yan, Xifeng.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 157 p.
  • 总页数 157
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号