首页> 外文学位 >An Effective Approach to Biomedical Information Extraction with Limited Training Data.
【24h】

An Effective Approach to Biomedical Information Extraction with Limited Training Data.

机译:具有有限培训数据的生物医学信息提取的有效方法。

获取原文
获取原文并翻译 | 示例

摘要

In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting concepts and the relations between them from free text. Limitations in the size of training data, lack of lexicons and lack of relationship patterns are major factors for poor performance in information extraction. This is because the training data cannot possibly contain all concepts and their synonyms; and it contains only limited examples of relationship patterns between concepts. Creating training data, lexicons and relationship patterns is expensive, especially in the biomedical domain (including clinical notes) because of the depth of domain knowledge required of the curators.;Dictionary-based approaches for concept extraction in this domain are not sufficient to effectively overcome the complexities that arise because of the descriptive nature of human languages. For example, there is a relatively higher amount of abbreviations (not all of them present in lexicons) compared to everyday English text. Sometimes abbreviations are modifiers of an adjective (e.g. CD4-negative) rather than nouns (and hence, not usually considered named entities). There are many chemical names with numbers, commas, hyphens and parentheses (e.g. t(3;3)(q21;q26)), which will be separated by most tokenizers. In addition, partial words are used in place of full words (e.g. up- and downregulate); and some of the words used are highly specialized for the domain. Clinical notes contain peculiar drug names, anatomical nomenclature, other specialized names and phrases that are not standard in everyday English or in published articles (e.g. "l shoulder inj"). State of the art concept extraction systems use machine learning algorithms to overcome some of these challenges. However, they need a large annotated corpus for every concept class that needs to be extracted.;A novel natural language processing approach to minimize this limitation in concept extraction is proposed here using distributional semantics. Distributional semantics is an emerging field arising from the notion that the meaning or semantics of a piece of text (discourse) depends on the distribution of the elements of that discourse in relation to its surroundings. Distributional information from large unlabeled data is used to automatically create lexicons for the concepts to be tagged, clusters of contextually similar words, and thesauri of distributionally similar words. These automatically generated lexical resources are shown here to be more useful than manually created lexicons for extracting concepts from both literature and narratives. Further, machine learning features based on distributional semantics are shown to improve the accuracy of BANNER, and could be used in other machine learning systems such as cTakes to improve their performance.;In addition, in order to simplify the sentence patterns and facilitate association extraction, a new algorithm using a "shotgun" approach is proposed. The goal of sentence simplification has traditionally been to reduce the grammatical complexity of sentences while retaining the relevant information content and meaning to enable better readability for humans and enhanced processing by parsers. Sentence simplification is shown here to improve the performance of association extraction systems for both biomedical literature and clinical notes. It helps improve the accuracy of protein-protein interaction extraction from the literature and also improves relationship extraction from clinical notes (such as between medical problems, tests and treatments).;Overall, the two main contributions of this work include the application of sentence simplification to association extraction as described above, and the use of distributional semantics for concept extraction. The proposed work on concept extraction amalgamates for the first time two diverse research areas --distributional semantics and information extraction. This approach renders all the advantages offered in other semi-supervised machine learning systems, and, unlike other proposed semi-supervised approaches, it can be used on top of different basic frameworks and algorithms.
机译:在当前的千年中,计算机和互联网的广泛使用导致信息呈指数级增长。很少有研究领域像信息提取一样重要,后者主要涉及从自由文本中提取概念及其之间的关系。训练数据大小的限制,词典的缺乏和关系模式的缺乏是信息提取性能不佳的主要因素。这是因为训练数据不可能包含所有概念及其同义词。它仅包含概念之间关系模式的有限示例。创建培训数据,词典和关系模式非常昂贵,特别是在生物医学领域(包括临床笔记),因为策展人需要领域知识的深度。在此领域,基于词典的概念提取方法不足以有效克服由于人类语言的描述性而产生的复杂性。例如,与日常英语文本相比,缩写词数量相对较多(并非所有缩写词都出现在词典中)。有时,缩写词是形容词(例如CD4阴性)的修饰词,而不是名词(因此通常不被视为命名实体)的修饰词。许多化学名称带有数字,逗号,连字符和括号(例如t(3; 3)(q21; q26)),大多数标记符会将其分隔开。另外,使用部分单词代替完整单词(例如上调和下调);并且其中一些词是针对该领域的高度专用词。临床注释包含特殊的药物名称,解剖学术语,其他专用名称和短语,这些在日常英语或已发表的文章(例如“ l肩膀注射”)中都不是标准的。最先进的概念提取系统使用机器学习算法来克服其中一些挑战。但是,对于每个需要提取的概念类,它们都需要一个带注释的大型语料库。在此,提出了一种使用分布语义的新颖自然语言处理方法,以最小化概念提取中的这种限制。分布语义学是一个新兴的领域,其源于以下概念:文本(话语)的含义或语义取决于该话语的元素相对于其周围环境的分布。来自大量未标记数据的分布信息用于自动创建要标记的概念的词典,上下文相似单词的簇以及分布相似单词的叙词表。在这里,这些自动生成的词汇资源比手动创建的词典更有用,可以从文献和叙述中提取概念。此外,显示了基于分布语义的机器学习功能可以提高BANNER的准确性,并且可以在其他机器学习系统(例如cTakes)中使用以提高其性能。;此外,为了简化句子模式并促进关联提取,提出了一种使用“ shot弹枪”方法的新算法。传统上,简化句子的目的是减少句子的语法复杂性,同时保留相关的信息内容和含义,以使人类更好地阅读并增强解析器的处理能力。本文显示了句子简化功能,可提高针对生物医学文献和临床笔记的联想提取系统的性能。它有助于提高从文献中提取蛋白质-蛋白质相互作用的准确性,并提高从临床记录中提取关系的能力(例如医学问题,检验和治疗之间的关系)。总体而言,这项工作的两个主要贡献包括简化句子的应用如上所述的“关联提取”,以及将分布语义用于概念提取。拟议中的概念提取工作首次合并了两个不同的研究领域-分布式语义和信息提取。这种方法具有其他半监督机器学习系统提供的所有优势,并且与其他提议的半监督方法不同,它可以在不同的基本框架和算法之上使用。

著录项

  • 作者

    Jonnalagadda, Siddhartha.;

  • 作者单位

    Arizona State University.;

  • 授予单位 Arizona State University.;
  • 学科 Information Science.;Artificial Intelligence.;Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 155 p.
  • 总页数 155
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号