首页> 外文会议>Asia-pacific bioinformatics conference >Towards Comprehensive Structural Motif Mining for Better Fold Annotation in the 'Twilight Zone' of Sequence Dissimilarity
【24h】

Towards Comprehensive Structural Motif Mining for Better Fold Annotation in the 'Twilight Zone' of Sequence Dissimilarity

机译:朝向综合结构主题采矿,以便在序列异化的“暮光区”中更好地折叠注释

获取原文

摘要

Here we report a novel graph database mining method called APGM (APproximate Graph Mining) and demonstrate the application to protein structure pattern identification and structure classification. We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrix as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. The biologic motivation of our study is to recognize common structure patterns in"immunoevasins", proteins mediating virus evasion of host immune defense. We investigated two immunologically relevant protein domain families: the Immunoglobulin V set and the Immunoglobulin C1 set. We collected proteins from SCOP release 1.69. For each family we created a culled set of proteins with maximal pairwise sequence identity percentage below 30% by using PISCES server. We combined these proteins and randomly selected proteins to create training and testing data set. We compared our method with one exact graph mining method MGM on classification accuracy. For Immunoglobulin C1 set,the classification based on feature identified by MGM only can reach 73%, while APGM is between 69% ~ 91%. For Immunoglobulin V set, since the exact match method cannot mine any meaningful patterns, it fails in classification, while by using APGM, we have the accuracy about 78%. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. And without loss of generality,choice of appropriate compatibility matrices allows our method to be easily employed in any domain where subgraph labels have some uncertainty.
机译:在这里,我们报告了一种名为APGM的新型图形数据库挖掘方法(近似图形挖掘),并证明蛋白质结构模式识别和结构分类的应用。我们提出了一个理论框架,提供整合前的领域知识,如替换矩阵为在这里学习一个实用软件的实施,并制定一个有效的算法,以确定近似匹配的频繁子。通过这样做,我们显着扩展了复杂数据挖掘算法的分析力,以处理大量的复杂和嘈杂的蛋白质结构数据。我们的研究的生物学动机是识别“免疫缺陷”,蛋白质中的常见结构模式,介导病毒免疫防御。我们调查了两种免疫相关蛋白质结构域家族:免疫球蛋白V套和免疫球蛋白C1设定。我们从SCOP释放1.69中收集蛋白质。对于每个家庭,我们通过使用Pisces服务器创建了一种具有最大成对序列标识百分比的剔除蛋白质,最大成对序列标识百分比低于30%。我们组合这些蛋白质和随机选择的蛋白质来创建训练和测试数据集。我们将我们的方法与一个精确的图形挖掘方法MGM进行了比较了分类准确性。对于免疫球蛋白C1设定,基于MGM鉴定的特征的分类只能达到73%,而APGM在69%〜91%之间。对于免疫球蛋白V集,由于确切的匹配方法无法挖掘任何有意义的模式,因此它在分类中失败,而通过使用APGM,我们的准确性约为78%。我们使用病毒和非病毒蛋白的实验研究证明了所提出的方法的效率和功效。在没有损失的情况下,适当的兼容性矩阵的选择允许我们的方法在外部域中容易地使用,其中子图标签具有一些不确定性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号