...
首页> 外文期刊>PLoS Biology >Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
【24h】

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

机译:Textpresso:基于本体的生物文献信息检索与提取系统

获取原文
           

摘要

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.
机译:我们已经开发了Textpresso,这是一种用于科学文献的新型文本挖掘系统,其功能远远超出了简单的关键字搜索引擎的功能。 Textpresso的两个主要元素是将科学文章全文分为单个句子的集合,以及术语类别的实现,可在其中搜索文章和单个句子的数据库。类别是生物学概念的类别(例如,基因,等位基因,细胞或细胞群,表型等)以及与两个对象(例如,关联,调控等)相关或描述一个对象(例如,生物学过程等)的类别。 )。它们共同构成了称为本体的对象和概念类型的目录。在用术语填充该本体之后,标记整个文章和摘要的语料库以标识这些类别的术语。当前的本体包括33个术语类别。搜索引擎使用户能够搜索句子或文档中的这些标签和/或关键字中的一个或组合,并且由于本体允许查询词义,因此可以制定语义查询。全文访问将生物数据类型的召回率从45%提高到95%。本体可以显着加快特定生物学事实(例如基因与基因之间的相互作用)的提取,Textpresso可以自动执行几乎与专家策展人一样识别句子的功能;在搜索两个唯一命名的基因和一个交互项时,本体使搜索效率提高了3倍。 Textpresso目前专注于秀丽隐杆线虫文学,拥有3,800篇全文文章和16,000篇摘要。本体的词典包含14,500个条目,每个条目都包含特定单词或短语的所有版本,并且它包含基因本体数据库的所有类别。 Textpresso是有用的策展工具,也是研究人员的搜索引擎,并且可以轻松地扩展到其他特定于有机体的文本语料库。可以在http://www.textpresso.org上或通过WormBase在http://www.wormbase.org上访问Textpresso。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号