Automatic extraction of candidate nomenclature terms using the doublet method

Jules J Berman

首页> 外文期刊>BMC Medical Informatics and Decision Making >Automatic extraction of candidate nomenclature terms using the doublet method

【24h】

Automatic extraction of candidate nomenclature terms using the doublet method

机译：使用doublet方法自动提取候选术语术语

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. Results A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an title. The total number of words included in the titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. Conclusion The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.

机译：背景技术新术语不断进入生物医学文献。策展人如何确定可以添加到现有术语中的新术语？最直接且有效的方法涉及阅读最新文献。学术策展人会在遇到新术语时添加它们。当今的学者受到大量生物医学文献的严峻挑战。医学术语的策展人如果希望保持其术语最新，则需要计算辅助。本文的目的是描述一种从大量生物医学文本中快速提取新的候选术语的方法。生成的术语列表可以由策展人快速审核，并在适当时添加到术语中。候选项提取器使用先前描述的双峰编码方法的变体。该算法几乎可以对任何术语进行操作，它源于以下观察结果：知识域中的大多数术语完全由同一知识域中其他术语中的单词组合组成。术语可以表示为重叠单词双峰的序列，其含义比组成该单词的单个单词更具体。该算法分析文本，找到已知在参考术语中某处出现的单词双峰的连续序列。当遇到匹配单词双峰的序列时，会将其与术语中已经包含的整个术语进行比较。如果双峰序列尚未在术语中，则将其提取为候选新项。策展人可以审查候选新术语，以确定是否应将其添加到术语中。通过国家医学图书馆PubMed查询服务获得的已发表摘要的语料库，并使用“肿瘤的发育谱系分类和分类学”作为参考术语，论证了该算法的实现。结果使用双峰提取方法分析了一个31+ MB的病理学期刊摘要语料库。该语料库由4,289条记录组成，每条记录都包含一个标题。标题中包含的单词总数为50,547。从语料库的摘要标题中自动提取了新的术语候选词。在CPU速度为2.79 GHz的台式计算机上，总执行时间为2秒。结果输出由313个新的候选词组成，每个词均由在参考术语中找到的串联双峰组成。对313个候选字词进行人工审查，得出了由策展人批准的285个字词列表。最终自动提取重复项产生了222个新术语的最终列表（原始313个提取的候选术语的71％），可以添加到参考术语中。结论自动提取候选术语术语的doublet方法可用于快速从大量文本中查找新术语。该方法几乎可以立即适用于任何文本和任何命名法。本文提供了以Perl编程语言实现的算法。

著录项

来源
《BMC Medical Informatics and Decision Making》 |2005年第1期|共页
作者
Jules J Berman;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词

相似文献

外文文献
中文文献
专利

1. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms [J] . Ziqi Zhang, Johann Petrak, Diana Maynard Procedia Computer Science . 2018,第22期

机译：改编的用于词条提取的TextRank：一种改进自动词条提取算法的通用方法
2. Extraction of electronic health record data in a hospital setting: comparison of automatic and semi-automatic methods using anti-TNF therapy as model. [J] . Thomas Cars, Bj?rn Wettermark, Rickard E Malmstr?m, Basic & clinical pharmacology & toxicology. . 2013,第6期

机译：在医院环境中提取电子健康记录数据：使用抗TNF治疗作为模型的自动方法和半自动方法的比较。
3. Fully automatic extraction of human spine curve from MR images using methods of efficient intervertebral disk extraction and vertebra registration [J] . Zhenyu Tang, Josef Pauli International Journal of Computer Assisted Radiology and Surgery . 2011,第1期

机译：使用有效的椎间盘提取和椎骨定位方法从MR图像全自动提取人的脊柱曲线
4. Automatic Extraction of Thai-English Term Translations and Synonyms from Medical Web Using Iterative Candidate Generation with Association Measures [C] . Kobkrit Viriyayudhakom, Thanaruk Theeramunkong, Cholwich Nattee, Conference on Knowledge Discovery and Data Mining . 2010

机译：使用关联度量的迭代候选生成自动提取泰语 - 英语术语翻译和来自医疗网络的同义词
5. Parallel automatic term extraction from large Web corpora. [D] . Zhang, Lingyan. 2004

机译：从大型Web语料库中并行自动提取术语。
6. Automatic extraction of candidate nomenclature terms using the doublet method [O] . Jules J Berman 2005

机译：使用doublet方法自动提取候选术语术语
7. Automatic extraction of candidate nomenclature terms using the doublet method [O] . 2005

机译：使用doublet方法自动提取候选术语术语

Automatic extraction of candidate nomenclature terms using the doublet method

摘要

著录项

相似文献

相关主题

期刊订阅