首页> 外文会议>International workshop on database and expert systems applications >Using Neo4j for Mining Protein Graphs: A Case Study
【24h】

Using Neo4j for Mining Protein Graphs: A Case Study

机译:使用Neo4j挖掘蛋白质图:一个案例研究

获取原文

摘要

Using graph databases becomes increasingly popular in domains where data can be modeled as a set of connected objects. Graph databases enable to query such data using graphbased queries in a relatively simple manner in comparison to the classical relational databases. In this paper, we show how one of the most popular graph databases, Neo4j, can be applied to the bioinformatics problem of protein-protein interface (PPI) identification. The goal of the PPI identification task is, given a protein structure, to identify amino acids which are responsible for binding of the structure to other proteins. Each protein structure consists of a set of amino acid molecules which can be conceived as a graph and multitude of methods for analysis of such protein graphs have been established. We introduce here a knowledge-based approach which can enhance the quality of these methods by utilizing existing protein structure knowledge stored in the Protein Data Bank (PDB). We show how to transform information about protein complexes from PDB into Neo4j where they can be stored as a set of independent protein graphs. The resulting graph database contains about 14 millions labeled nodes and 38 millions edges. In the PPI identification phase, this database is queried using exact subgraph matching and the results are aggregated to improve an existing PPI identification method. We show the pros and cons of using Neo4j for such endeavor with respect to the size of the database and complexity of the queries in comparison to using a relational database (Microsoft SQL Server). We conclude that using Neo4j is a viable option for specific, rather small, subgraph query types. However, we have encountered performance limitations, especially for larger query graphs in terms of number of edges.
机译:在可以将数据建模为一组连接对象的领域中,使用图形数据库变得越来越流行。与传统的关系数据库相比,图形数据库能够以相对简单的方式使用基于图形的查询来查询此类数据。在本文中,我们展示了如何将最流行的图形数据库之一Neo4j应用于蛋白质-蛋白质界面(PPI)识别的生物信息学问题。给定蛋白质结构,PPI鉴定任务的目标是鉴定负责将该结构与其他蛋白质结合的氨基酸。每个蛋白质结构由一组氨基酸分子组成,这些氨基酸分子可以被认为是一个图,并且已经建立了许多用于分析这种蛋白质图的方法。我们在这里介绍一种基于知识的方法,该方法可以通过利用存储在蛋白质数据库(PDB)中的现有蛋白质结构知识来提高这些方法的质量。我们展示了如何将有关蛋白质复合物的信息从PDB转换为Neo4j,在其中它们可以存储为一组独立的蛋白质图。生成的图形数据库包含大约1400万个带标签的节点和3800万个边。在PPI识别阶段,使用精确的子图匹配来查询该数据库,并对结果进行汇总以改进现有的PPI识别方法。与使用关系数据库(Microsoft SQL Server)相比,我们展示了Neo4j在数据库规模和查询复杂性方面进行利弊的利弊。我们得出结论,对于特定的,较小的子图查询类型,使用Neo4j是可行的选择。但是,我们遇到了性能限制,尤其是对于较大的查询图而言,在边数方面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号