首页> 外文期刊>BMC Medical Informatics and Decision Making >Automatic extraction of candidate nomenclature terms using the doublet method
【24h】

Automatic extraction of candidate nomenclature terms using the doublet method

机译:使用doublet方法自动提取候选术语术语

获取原文
           

摘要

Background New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. Results A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an title. The total number of words included in the titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. Conclusion The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.
机译:背景技术新术语不断进入生物医学文献。策展人如何确定可以添加到现有术语中的新术语?最直接且有效的方法涉及阅读最新文献。学术策展人会在遇到新术语时添加它们。当今的学者受到大量生物医学文献的严峻挑战。医学术语的策展人如果希望保持其术语最新,则需要计算辅助。本文的目的是描述一种从大量生物医学文本中快速提取新的候选术语的方法。生成的术语列表可以由策展人快速审核,并在适当时添加到术语中。候选项提取器使用先前描述的双峰编码方法的变体。该算法几乎可以对任何术语进行操作,它源于以下观察结果:知识域中的大多数术语完全由同一知识域中其他术语中的单词组合组成。术语可以表示为重叠单词双峰的序列,其含义比组成该单词的单个单词更具体。该算法分析文本,找到已知在参考术语中某处出现的单词双峰的连续序列。当遇到匹配单词双峰的序列时,会将其与术语中已经包含的整个术语进行比较。如果双峰序列尚未在术语中,则将其提取为候选新项。策展人可以审查候选新术语,以确定是否应将其添加到术语中。通过国家医学图书馆PubMed查询服务获得的已发表摘要的语料库,并使用“肿瘤的发育谱系分类和分类学”作为参考术语,论证了该算法的实现。结果使用双峰提取方法分析了一个31+ MB的病理学期刊摘要语料库。该语料库由4,289条记录组成,每条记录都包含一个标题。标题中包含的单词总数为50,547。从语料库的摘要标题中自动提取了新的术语候选词。在CPU速度为2.79 GHz的台式计算机上,总执行时间为2秒。结果输出由313个新的候选词组成,每个词均由在参考术语中找到的串联双峰组成。对313个候选字词进行人工审查,得出了由策展人批准的285个字词列表。最终自动提取重复项产生了222个新术语的最终列表(原始313个提取的候选术语的71%),可以添加到参考术语中。结论自动提取候选术语术语的doublet方法可用于快速从大量文本中查找新术语。该方法几乎可以立即适用于任何文本和任何命名法。本文提供了以Perl编程语言实现的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号