The identification of chemical–protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical–protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot
展开▼
机译:鉴定文献中描述的化学-蛋白质相互作用是一项重要的任务,在药物设计、精准医学和生物技术中具有应用。从生物医学文献中手动提取此类关系成本高昂,而且通常非常耗时。BioCreative VII DrugProt 共享任务为从科学文本中自动提取化学-蛋白质关系的方法提供了基准。在这里,我们描述了我们对共同任务的贡献并报告了所取得的成果。我们将任务定义为关系分类问题,我们使用预训练的 transformer 语言模型来处理这个问题。在这个基本架构上,我们尝试利用知识库中的文本和嵌入式侧面信息以及其他训练数据来提高提取性能。我们对提议的模型和各个扩展进行了全面评估,包括广泛的超参数搜索,从而产生了 2647 次不同的运行。我们发现,集成和选择正确的预训练语言模型对于最佳性能至关重要,而添加额外的数据和嵌入的侧面信息并没有改善结果。我们最好的模型基于 10 个预训练的转换器的集合和来自比较毒理基因组学数据库的化学品的附加文本描述。该模型的 F1 分数为 79。在隐藏的 DrugProt 测试集上获得 73%,并在官方评估中提交的 107 次运行中排名第一。数据库 URL:https://github.com/leonweber/drugprot
展开▼