首页> 外文会议>The 7th Asia-Pacific Bioinformatics Conference(第七届亚太生物信息学大会) >Towards Comprehensive Structural Motif Mining for Better Fold Annotation in the 'Twilight Zone' of Sequence Dissimilarity
【24h】

Towards Comprehensive Structural Motif Mining for Better Fold Annotation in the 'Twilight Zone' of Sequence Dissimilarity

机译:在“序列差异”的“暮光区”中寻求更好的折叠注释的综合结构基元挖掘

获取原文

摘要

Here we report a novel graph database mining method called APGM (APproximate Graph Mining) and demonstrate the application to protein structure pattern identification and structure classification. We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrix as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. The biologic motivation of our study is to recognize common structure patterns in"immunoevasins", proteins mediating virus evasion of host immune defense. We investigated two immunologically relevant protein domain families: the Immunoglobulin V set and the Immunoglobulin C1 set. We collected proteins from SCOP release 1.69. For each family we created a culled set of proteins with maximal pairwise sequence identity percentage below 30% by using PISCES server. We combined these proteins and randomly selected proteins to create training and testing data set. We compared our method with one exact graph mining method MGM on classification accuracy. For Immunoglobulin C1 set,the classification based on feature identified by MGM only can reach 73%, while APGM is between 69% ~ 91%. For Immunoglobulin V set, since the exact match method cannot mine any meaningful patterns, it fails in classification, while by using APGM, we have the accuracy about 78%. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. And without loss of generality,choice of appropriate compatibility matrices allows our method to be easily employed in any domain where subgraph labels have some uncertainty.
机译:在这里,我们报告一种称为APGM(APproximate Graph Mining)的新型图形数据库挖掘方法,并演示了其在蛋白质结构模式识别和结构分类中的应用。我们提出一个理论框架,为合并先前的领域知识(如此处研究的替代矩阵)提供实用的软件实现,并设计一种有效的算法来识别近似匹配的频繁子图。通过这样做,我们极大地扩展了复杂数据挖掘算法在处理大量复杂且嘈杂的蛋白质结构数据时的分析能力。我们研究的生物学动机是识别“免疫血管素”中常见的结构模式,这些蛋白质介导宿主免疫防御的病毒逃逸。我们调查了两个与免疫学相关的蛋白质结构域家族:免疫球蛋白V集和免疫球蛋白C1集。我们从SCOP版本1.69收集了蛋白质。对于每个家族,我们使用PISCES服务器创建了一组最大配对序列同一性百分比低于30%的蛋白质。我们将这些蛋白质和随机选择的蛋白质结合在一起,以创建训练和测试数据集。我们将我们的方法与一种精确的图挖掘方法MGM进行了分类精度的比较。对于免疫球蛋白C1集,仅基于MGM识别的特征分类可达到73%,而APGM在69%〜91%之间。对于免疫球蛋白V集,由于精确匹配方法无法挖掘任何有意义的模式,因此分类失败,而通过使用APGM,我们的准确度约为78%。我们的实验研究同时使用病毒和非病毒蛋白,证明了所提出方法的效率和功效。在不失一般性的前提下,选择适当的兼容性矩阵使我们的方法可以轻松应用于子图标签具有某些不确定性的任何领域。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号