...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >PROFILE-BASED STRING KERNELS FOR REMOTE HOMOLOGY DETECTION AND MOTIF EXTRACTION
【24h】

PROFILE-BASED STRING KERNELS FOR REMOTE HOMOLOGY DETECTION AND MOTIF EXTRACTION

机译:基于概要文件的字符串内核,用于远程同源性检测和motif提取

获取原文
获取原文并翻译 | 示例
           

摘要

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" — short regions of the original profile that contribute almost all the weight of the SVM classification score — and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.
机译:我们介绍与支持向量机(SVM)一起使用的基于配置文件的新型字符串内核,用于解决蛋白质分类和远程同源性检测的问题。这些内核使用概率分布图(例如由PSI-BLAST算法生成的分布图)来定义沿蛋白质序列的位置相关的突变邻域,以实现数据中k长度子序列(“ k-mers”)的不精确匹配。通过使用有效的数据结构,一旦获得了配置文件,内核即可快速进行计算。例如,运行PSI-BLAST以构建配置文件所需的时间明显长于内核计算时间和SVM训练时间。我们提出了基于SCOP数据库的远程同源性检测实验,在该实验中,我们证明了与SVM分类器一起使用的基于配置文件的字符串内核大大优于最近提出的所有受监督SVM方法。我们进一步研究了如何将预测的二级结构信息合并到配置文件内核中,以实现较小但显着的性能改进。我们还展示了如何使用学习到的SVM分类器提取“区分性序列基序”(原始轮廓的短区域几乎贡献了SVM分类分数的所有权重),并说明了这些区分性基序对应于SVM分类中有意义的结构特征。蛋白质数据。 PSI-BLAST配置文件的使用可以看作是一种半监督学习技术,因为PSI-BLAST利用来自大型序列数据库的未标记数据来构建更多信息的配置文件。最近提出的“簇核”给出了改善SVM蛋白质分类性能的一般半监督方法。我们表明,我们的概要文件内核结果也优于群集内核,同时为大型数据集提供了更好的可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号