首页> 外文期刊>BMC Bioinformatics >Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge
【24h】

Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

机译:使用从生物医学文献中提取的基于文本的特征的蛋白质功能预测:CAFA挑战

获取原文
获取外文期刊封面目录资料

摘要

BackgroundAdvances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge.ResultsWe have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well.ConclusionsOur evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.
机译:过去十年中排序技术的背景导致了丰富的序列蛋白,其功能尚未赘述。因此,可以自动预测和注释蛋白质功能的计算系统是需求。大多数计算系统使用源自蛋白质序列或蛋白质结构的特征来预测功能。在早期的工作中,我们证明了生物医学文献的效用作为预测蛋白质亚细胞位置的文本特征来源。我们还表明,基于文本和基于序列的预测的组合提高了位置预测器的性能。在这项工作之后,对于函数注释(CAFA)挑战的关键评估,我们开发了一种基于文本的系统,旨在预测未经发布的蛋白质的分子功能和生物过程(使用基因本体论术语)。在本文中,我们展示了我们为我们的系统执行的初步工作和评估,作为CAFA挑战的一部分。培训术已经开发出一种初步系统,该初步系统代表了使用基于文本的特征的蛋白质,并使用K-最近邻分类预测蛋白质功能(Text-Knn)。我们通过根据其统计属性从生物医学摘要中提取关键术语来选择对分类器的文本功能。在36,536蛋白的数据集上使用5倍交叉验证进行培训和测试系统。使用标准测量的精度,召回,F测量和整体精度测量系统性能。将系统的性能与两个基线分类器进行比较:仅基于蛋白质函数(基本事先)的先前分配的功能(基于序列相似度(Base-SEQ)分配一个功能。用于分子函数类的文本KNN,基础和基础SEQ的总体预测精度为62%,43%和58%,而生物过程类的总体精度分别为17%,11%和28% 。作为CAFA DataSet的一部分获得的结果也会报告.CORCLUSOUR评估显示基于文本的分类器始终如一地优于基于先前分配的基线分类器,并且通常对使用的基线分类器具有相当的性能序列相似度。此外,结果表明,将具有其他类型特征的文本特征组合可能导致改进的预测性能。初步结果还表明,虽然我们的文本的分类剂可用于预测涉及蛋白质的分子功能和生物学过程,但是分类器可显着对预测分子功能显着优于预测生物学过程。对于参与CAFA挑战的其他分类器,观察了类似的趋势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号