UPCLASS: a deep learning-based classifier for UniProtKB entry publications

Douglas Teodoro; Julien Knafou; Nona Naderi; Emilie Pasche; Julien Gobeill; Cecilia N Arighi; Patrick Ruch

摘要

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession.Database URL:https://goldorak.hesge.ch/bioexpclass/upclass/.

机译：在UNIPROT知识库（UNIPROTKB）中，为特定蛋白质注释条目提供证据的出版物在不同类别中组织，例如功能，交互和表达，基于它们包含的数据类型。为了提供在UNIPROT中对计算映射的书目进行分类的系统方法，我们调查了一个卷积神经网络（CNN）模型，根据UNIPROTKB类别对具有加入注释的出版物进行分类。在加入注释级别进行分类出版物的主要挑战是，相同的发布可以用多种蛋白质注释，因此根据为蛋白质提供的证据与不同的类别组相关联。我们提出了一种模型，该模型将文件划分为含有蛋白质注释的零件的部件。然后，我们使用这些部件为每个备份创建不同的特征集，并为每个分开为单独的网络提供。 CNN模型实现了0.72的Micro F1分数和0.62的宏F1分数，表现优于基于逻辑回归的基线模型，并将向量机分别支撑高达22和18个百分点。我们认为，这种方法可用于系统地对uniprotkb中的计算映射的参考书目系统地分类，这代表了一系列的出版物，以及帮助策展人来确定出版物是否与蛋白质附加的进一步策策相关.Database URL：HTTPS ：//goldorak.hiog.ch/bioexpclass/upclass/.

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

摘要

著录项

相关主题

期刊订阅