Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

Andrew Wong; Hagit Shatkay

首页> 外文期刊>BMC Bioinformatics >Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

【24h】

Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

机译：使用从生物医学文献中提取的基于文本的特征的蛋白质功能预测：CAFA挑战

获取原文

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

BackgroundAdvances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge.ResultsWe have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well.ConclusionsOur evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.

机译：过去十年中排序技术的背景导致了丰富的序列蛋白，其功能尚未赘述。因此，可以自动预测和注释蛋白质功能的计算系统是需求。大多数计算系统使用源自蛋白质序列或蛋白质结构的特征来预测功能。在早期的工作中，我们证明了生物医学文献的效用作为预测蛋白质亚细胞位置的文本特征来源。我们还表明，基于文本和基于序列的预测的组合提高了位置预测器的性能。在这项工作之后，对于函数注释（CAFA）挑战的关键评估，我们开发了一种基于文本的系统，旨在预测未经发布的蛋白质的分子功能和生物过程（使用基因本体论术语）。在本文中，我们展示了我们为我们的系统执行的初步工作和评估，作为CAFA挑战的一部分。培训术已经开发出一种初步系统，该初步系统代表了使用基于文本的特征的蛋白质，并使用K-最近邻分类预测蛋白质功能（Text-Knn）。我们通过根据其统计属性从生物医学摘要中提取关键术语来选择对分类器的文本功能。在36,536蛋白的数据集上使用5倍交叉验证进行培训和测试系统。使用标准测量的精度，召回，F测量和整体精度测量系统性能。将系统的性能与两个基线分类器进行比较：仅基于蛋白质函数（基本事先）的先前分配的功能（基于序列相似度（Base-SEQ）分配一个功能。用于分子函数类的文本KNN，基础和基础SEQ的总体预测精度为62％，43％和58％，而生物过程类的总体精度分别为17％，11％和28％。作为CAFA DataSet的一部分获得的结果也会报告.CORCLUSOUR评估显示基于文本的分类器始终如一地优于基于先前分配的基线分类器，并且通常对使用的基线分类器具有相当的性能序列相似度。此外，结果表明，将具有其他类型特征的文本特征组合可能导致改进的预测性能。初步结果还表明，虽然我们的文本的分类剂可用于预测涉及蛋白质的分子功能和生物学过程，但是分类器可显着对预测分子功能显着优于预测生物学过程。对于参与CAFA挑战的其他分类器，观察了类似的趋势。

著录项

来源
《BMC Bioinformatics》 |2013年第3期|共页
作者
Andrew Wong; Hagit Shatkay;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens [J] . Naihui Zhou, Yuxiang Jiang, Timothy R. Bergquist, Genome Biology . 2019,第1期

机译：CAFA挑战通过实验筛网报告了数百个基因的改善的蛋白质功能预测和新的功能注释
2. BioPPISVMExtractor: a protein-protein interaction extractor for biomedical literature using SVM and rich feature sets. [J] . Yang Z, Lin H, Li Y Journal of biomedical informatics. . 2010,第1期

机译：BioPPISVMExtractor：使用SVM和丰富功能集的生物医学文献的蛋白质-蛋白质相互作用提取器。
3. CAFA and the Open World of protein function predictions [J] . Dessimoz Christophe, Skunca Nives, Thomas Paul D. Trends in Genetics . 2013,第11期

机译：CAFA和蛋白质功能预测的开放世界
4. Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct [C] . Morteza Pourreza Shahri, Indika Kahanda International Conference on Bioinformatics and Computational Biology . 2018

机译：利用诸如诸如杂皮的自动化蛋白质表型预测从生物医学文献中提取合作特征
5. A figure-based system for extracting, archiving, and retrieving protein-protein interactions (PPIS) from biomedical literature. [D] . Lopez-Gutierrez, Luis D. 2013

机译：一个基于图的系统，用于从生物医学文献中提取，存档和检索蛋白质-蛋白质相互作用（PPIS）。
6. Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge [O] . Andrew Wong, Hagit Shatkay 2013

机译：使用从生物医学文献中提取的基于文本的特征进行蛋白质功能预测：CAFA挑战
7. Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge [O] . 2013

机译：使用从生物医学文献中提取的基于文本的特征进行蛋白质功能预测：CAFA挑战

Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅