首页> 外文期刊>BMC Bioinformatics >A discriminative method for protein remote homology detection and fold recognition combining Top- n -grams and latent semantic analysis
【24h】

A discriminative method for protein remote homology detection and fold recognition combining Top- n -grams and latent semantic analysis

机译:结合Top-n-grams和潜在语义分析的蛋白质远程同源性检测和折叠识别的判别方法

获取原文
       

摘要

Background Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. Results In this paper, a novel building block of proteins called Top- n -grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top- n -grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top- n -gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top- n -grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top- n -grams and LSA gives significantly better results compared to related methods. Conclusion The method based on Top- n -grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top- n -gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
机译:背景技术蛋白质远程同源性检测和折叠识别是生物信息学中的核心问题。当前,基于支持向量机(SVM)的判别方法是解决这些问题的最有效,最准确的方法。改进基于SVM的方法的性能的关键步骤是找到合适的蛋白质序列表示形式。结果在本文中,提出了一种称为Top-n-grams的新型蛋白质构建基块,其中包含从蛋白质序列频率图谱中提取的进化信息。从PSI-BLAST输出的多个序列比对中计算蛋白质序列频率图谱,并将其转换为Top-n-grams。通过每个Top-n-gram的出现时间,将蛋白质序列转化成固定尺寸的特征向量。训练向量通过SVM进行评估,以训练分类器,然后使用分类器对测试蛋白序列进行分类。我们证明结合Top-n-grams和潜在语义分析(LSA)可以提高远程同源性检测和折叠识别的预测性能,这是一种从自然语言处理中提取特征的有效技术。当在超家族和折叠基准上进行测试时,与相关方法相比,结合了Top-n-grams和LSA的方法可获得明显更好的结果。结论基于Top-n-grams的方法明显优于基于其他许多构造块的方法,包括N-grams,模式,图案和二进制轮廓。因此,Top-n-gram是蛋白质序列的良好构建基块,可广泛用于计算生物学的许多任务,例如序列比对,域边界的预测,基于知识的电位的指定以及蛋白质结合位点的预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号