首页> 外文会议>Pacific Symposium on Biocomputing 2004; Jan 6-10, 2004; Hawaii, USA >BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY
【24h】

BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY

机译:生物术语:词汇知识和歧义的来源

获取原文
获取原文并翻译 | 示例

摘要

There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.
机译:已经开发了涉及自然语言处理(NLP)的自动系统,以识别和提取文献中的基因组信息,这方面的工作正在增加。识别和识别生物实体是此过程中的关键步骤。 NLP系统通常依赖于术语和本体规范作为资源来确定实体的名称,分配与相应本体一致的语义类别以及分配映射到特定术语内定义明确的实体的标识符。尽管术语和本体对于文本处理系统很有价值,但是它们是为帮助研究人员而开发的,它们在结构和语义上是异类的。从各种资源自动生成的,为NLP目的而设计的统一资源将是该领域的有用工具,并且将进一步提高数据库的互操作性。本文介绍了朝着这个目标的工作。我们已经从四个模型生物命名系统(鼠标,苍蝇,蠕虫和酵母)自动创建了词汇资源,并研究了现有NLP系统GENIES中资源的性能。使用命名法并不容易,因为涉及歧义性,同义词和名称变体的问题非常具有挑战性。在本文中,我们主要关注歧义。我们确定,在四个命名法中,单个命名法中的歧义基因名称数量分别为0%-10.18%,1.187%-20.30%和0%-2.49%。在实际处理文本时,根据所考虑的生物,我们发现歧义出现的比率(不计算英语单词引起的歧义)的范围为2.4%-32.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号