首页> 外文期刊>BMC Bioinformatics >Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation
【24h】

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

机译:蛋白质亚细胞定位的半自动化管理:基于文本挖掘的基因本体(GO)细胞成分管理方法

获取原文
       

摘要

Background Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org , developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.
机译:背景技术从生物医学文献中手动整理实验数据是一项昂贵且费时的工作。尽管如此,大多数生物知识库仍然严重依赖人工管理来提取和输入数据。因此,可以半自动或完全自动地从文献中检索信息的文本挖掘软件将极大地促进人工管理工作。结果我们使用了由WormBase开发的基于Textpresso类别的信息检索和提取系统http://www.textpresso.org,以探索Textpresso如何提高我们手动将秀丽隐杆线虫蛋白整理到基因本体论的细胞成分本体论上的效率。 。通过使用一组训练有素的句子来描述已发表文献中的定位实验的结果,我们生成了三个新的策展任务特定类别(“细胞成分”,“分析术语”和“动词”),其中包含与实验确定的亚细胞定位报告相关的词和短语。我们将手动管理的结果与Textpresso查询的结果进行了比较,后者在文章全文中搜索包含三个新类别中的每一个的术语以及以前未固化的秀丽隐杆线虫蛋白质的名称的句子,并发现Textpresso搜索可找到可编辑的论文与手动管理相比,召回率和准确率分别为79.1%和61.8%(F评分为69.5%)。在这些文件中,Textpresso确定了相关的句子,其回忆率和准确率分别为30.3%和80.1%(F评分为44.0%)。从返回的句子中,策展人能够以97.3%的精度(在F评分中为78.8%)做出所有可能的实验支持的GO Cellular Component注解中的66.2%。通过测量基于Textpresso的策画与手动策展的相对效率,我们发现,鉴于各个策展速度的差异,Textpresso可以将策展效率提高至少8倍,甚至多达15倍。结论Textpresso是提高基于实验的手动策展效率的有效工具。在WormBase中整合了基于Textpresso的Cellular Component策展管道后,我们从严格的这种数据类型的手动策展过渡到了更高效的计算机辅助验证管道。持续开发特定于策展任务的Textpresso类别将为严重依赖手动策展的基因组数据库提供宝贵资源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号