...
首页> 外文期刊>Database >Integrating information retrieval with distant supervision for Gene Ontology annotation
【24h】

Integrating information retrieval with distant supervision for Gene Ontology annotation

机译:将信息检索与远程监督相集成,以进行基因本体注释

获取原文

摘要

This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. Database URL: https://github.comoname2020/Bioc
机译:本文介绍了我们在BioCreative IV中参与基因本体管理任务(GO任务)的过程,其中我们参与了两个子任务:A)全文文章中相关基因的GO证据句(GOES)的识别和B)GO术语的预测有关全文中相关基因的信息。对于子任务A,我们训练了逻辑回归模型以基于训练数据中的注释来检测GOES,并在注释中补充了来自外部资源的更多嘈杂声。然后,采用贪婪方法将基因与句子联系起来。对于子任务B,我们设计了两种类型的系统:(i)基于搜索的系统,该系统基于使用状态的不同文本粒度(即全文文章,摘要和句子)的GOES的现有注释来预测GO术语最先进的信息检索技术(即远程监管概念的新颖应用)和(ii)基于相似度的系统,该系统基于句子中单词与GO术语/同义词之间的距离分配GO术语。我们针对子任务A表现最好的系统基于完全匹​​配获得0.21的F1得分,并允许宽松重叠匹配达到0.387的F1得分。我们针对子任务B的性能最佳的系统(基于搜索的系统)基于完全匹​​配获得0.01的F1评分,考虑到层次匹配则获得0.301的F1得分。我们针对子任务B的基于搜索的系统明显优于基于相似度的系统。数据库URL:https://github.comoname2020/Bioc

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号