首页> 外文学位 >On Graph Perturbation Theory and Algorithms for Scalable Mining of Noisy and Uncertain Graph Data with Knowledge Priors.
【24h】

On Graph Perturbation Theory and Algorithms for Scalable Mining of Noisy and Uncertain Graph Data with Knowledge Priors.

机译:图扰动理论和算法用于有知识先验的噪声和不确定图数据的可伸缩挖掘。

获取原文
获取原文并翻译 | 示例

摘要

The solutions to many interesting questions lie in the ability to extract sets of tightly connected or correlated elements from a mass of relational-type data. For example, a researcher may formulate various questions about social communities within a social network, correlated stocks in market data, or proteins in a protein interaction network that form functional modules. However, such questions are often difficult to answer, as the data may exhibit noise or uncertainty, objects may belong to multiple groups, the data to be analyzed may be very large in scale, and the results of a naive algorithm may not be relevant to the user's interests. In this thesis, we make several novel contributions to solving this problem, including applying the knowledge priors of a domain expert to the algorithmic search process, as well as introducing the concept of Graph Perturbation Theory in order to cope with uncertainty.;In our first component, we tackle the problems of noisy and uncertain data by introducing the concept of Graph Perturbation Theory, which consists of calculating graph properties as the graph undergoes some perturbation, or change. This perturbation may represent some new knowledge, reinterpretation of old data, or exploratory "what if"-type scenario. We develop and apply Graph Perturbation Theory to the problem of enumerating the fully connected subgraphs of a graph. The advantage in modeling our correlated groups as maximal cliques is that cliques can produce overlapping groups, and their fully connected nature ensures a strong positive relationship despite uncertainty in the individual edges present. By enumerating the set of all such maximal cliques of the graph, we can exhaustively describe the correlations present in the graph and avoid relying on heuristic methods. Our application of Graph Perturbation Theory to this problem resulted in speedups of up to 120 times over enumerating the maximal cliques of the perturbed algorithm using a state-of-the-art clique enumeration algorithm.;In our second component, we harness the knowledge priors of domain experts by allowing a user to specify the "items of interest" that the correlated items should be enriched with respect to, allowing us to constrain the search space of the algorithm as well as improve the chance of discovering sets of identifying correlated items relevant to the user's interests. We tackle the problem of incorporating experts' knowledge priors into the search for overlapping groups in noisy and uncertain data by proposing a novel subgraph structure, the mu, gamma-quasi-clique. Our definition, which incorporates the information contained the user's query, is a relaxation of the fully connected clique property, and enables us to better compensate for missing, or false negative, relationships in our network, which would fracture a single large group into multiple, highly overlapping cliques. We evaluate the effectiveness of our approach by demonstrating its scalability to larger graphs and reproduce biological results that establish the practical value of our work.;In our third component, we address the large scale nature of modern datasets by developing parallel strategies for Graph Perturbation Theory algorithms. As Graph Perturbation Theory algorithms utilize various indices to facilitate calculating graph properties of the perturbed graph, we discuss techniques for dealing with the highly data-intensive nature of such algorithms. To establish the effectiveness of our proposed techniques, we implement a parallel version of the perturbed clique enumeration algorithm presented in the first component, developing additional theory that allows us to decouple the action of this algorithm from its previous computation. We present detailed empirical results demonstrating the scalability of our algorithm as well as the effectiveness of the various techniques we propose.
机译:许多有趣问题的解决方案在于从大量关系类型数据中提取紧密连接或相关元素集的能力。例如,研究人员可以提出有关社交网络中的社交社区,市场数据中的相关股票或形成功能模块的蛋白质相互作用网络中的蛋白质的各种问题。但是,此类问题通常很难回答,因为数据可能表现出噪声或不确定性,对象可能属于多个组,要分析的数据可能规模很大,并且朴素算法的结果可能与用户的利益。在本文中,我们为解决该问题做出了一些新颖的贡献,包括将领域专家的知识先验知识应用于算法搜索过程,以及引入图扰动理论的概念以应对不确定性。组件中,我们通过引入图扰动理论的概念来解决嘈杂和不确定数据的问题,该概念包括在图受到某些扰动或变化时计算图属性。这种干扰可能表示一些新知识,对旧数据的重新解释或探索性的“假设”类型的情况。我们开发了图摄动理论并将其应用到枚举图的全连接子图的问题。将我们的相关群体建模为最大群体的优势在于,群体可以产生重叠的群体,并且尽管存在单个边缘的不确定性,但群体的完全联系性质确保了牢固的正向关系。通过枚举图的所有此类最大派系的集合,我们可以详尽地描述图中存在的相关性,并避免依赖于启发式方法。我们将图扰动理论应用到此问题上,从而使用最先进的集团枚举算法对扰动算法的最大集团进行枚举,最多可将速度提高120倍;在第二个组件中,我们利用了先验知识通过允许用户指定应该丰富相关项目的“感兴趣的项目”,从而使我们能够限制算法的搜索空间,并提高发现与相关项目相关的识别集的机会符合用户的利益。我们通过提出一种新颖的子图结构mu(伽玛准古怪),解决了将专家的先验知识纳入在嘈杂和不确定数据中寻找重叠群体的问题。我们的定义结合了用户查询中包含的信息,是对完全连接的集团属性的放宽,使我们能够更好地补偿网络中缺失的或错误的负面关系,这种关系会将一个大的群体分解为多个,高度重叠的集团。我们通过证明其方法可扩展到更大的图并复制生物学结果来确定我们的工作的实用价值来评估该方法的有效性。在第三部分中,我们通过为图扰动理论开发并行策略来解决现代数据集的大规模性质。算法。由于图扰动理论算法利用各种索引来方便计算被扰动图的图属性,因此我们讨论了用于处理此类算法的高度数据密集型特性的技术。为了确定我们提出的技术的有效性,我们实现了第一个组件中出现的扰动集团枚举算法的并行版本,并开发了其他理论,使我们能够将该算法的作用与其先前的计算分离。我们提供了详细的经验结果,这些结果证明了我们算法的可扩展性以及我们提出的各种技术的有效性。

著录项

  • 作者

    Hendrix, William Thomas.;

  • 作者单位

    North Carolina State University.;

  • 授予单位 North Carolina State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 91 p.
  • 总页数 91
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号