Chemical–protein relation extraction with ensembles of carefully tuned pretrained language models

机译：使用精心调整的预训练语言模型系综进行化学-蛋白质关系提取

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相关主题

摘要

The identification of chemical–protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical–protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot

机译：鉴定文献中描述的化学-蛋白质相互作用是一项重要的任务，在药物设计、精准医学和生物技术中具有应用。从生物医学文献中手动提取此类关系成本高昂，而且通常非常耗时。BioCreative VII DrugProt 共享任务为从科学文本中自动提取化学-蛋白质关系的方法提供了基准。在这里，我们描述了我们对共同任务的贡献并报告了所取得的成果。我们将任务定义为关系分类问题，我们使用预训练的 transformer 语言模型来处理这个问题。在这个基本架构上，我们尝试利用知识库中的文本和嵌入式侧面信息以及其他训练数据来提高提取性能。我们对提议的模型和各个扩展进行了全面评估，包括广泛的超参数搜索，从而产生了 2647 次不同的运行。我们发现，集成和选择正确的预训练语言模型对于最佳性能至关重要，而添加额外的数据和嵌入的侧面信息并没有改善结果。我们最好的模型基于 10 个预训练的转换器的集合和来自比较毒理基因组学数据库的化学品的附加文本描述。该模型的 F1 分数为 79。在隐藏的 DrugProt 测试集上获得 73%，并在官方评估中提交的 107 次运行中排名第一。数据库 URL：https://github.com/leonweber/drugprot

著录项

期刊名称 Database: The Journal of Biological Databases and Curation
作者
Samuele Garda; Leon Weber; Ulf Leser; Christoph Alt; Mario Sänger; Fabio Barth;
展开▼
作者单位

展开▼
年(卷),期 2022(2022),2022
年度 2022
页码 baac098
总页数 10
原文格式 PDF
正文语种
中图分类生物学;
关键词

Chemical–protein relation extraction with ensembles of carefully tuned pretrained language models

摘要

著录项

引文网络

相关主题

期刊订阅