首页> 外文OA文献 >Beyond Linear Chain: A Journey through Conditional Random Fields for Information Extraction from Text
【2h】

Beyond Linear Chain: A Journey through Conditional Random Fields for Information Extraction from Text

机译:超越线性链:通过条件随机字段进行文本信息提取的旅程

摘要

Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades emph{natural language processing} (NLP) researchers have studied methods aimed at making computers "understand" the information enclosed in human language. emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. IE is divided into several subtasks, each of which aims to extract different structures from text, such as entities, relations, or more complex structures such as ontologies. In this thesis the term ``information extractionu27u27 is (somehow arbitrarily) used to identify only the subtasks that are formulated as emph{sequence labeling} tasks. Recently, the main approaches by means of which IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. When IE is tackled as a sequence labeling task (as in e.g., emph{named-entity recognition}, emph{concept extraction}, and in some cases emph{opinion mining}), among the best-performing supervised machine learning methods are certainly emph{probabilistic graphical models}, and, specifically, emph{Conditional Random Fields} (CRFs). In this thesis we investigate two major aspects related to information extraction from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, ``linear-chainu27u27 CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs (i.e., two-stage, ensemble, multi-label, hierarchical), that unlike the commonly adopted linear-chain CRFs have a customized structure that fits the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and opinion mining from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of the training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of the training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised emph{active learning} (AL).
机译:口头和书面的自然语言是人类相互交流信息的最重要方式。在过去的几十年中,“自然语言处理”(NLP)研究人员研究了旨在使计算机“理解”人类语言中包含的信息的方法。 emph {信息提取}(IE)是NLP的一个领域,致力于研究旨在从文本中提取信息的方法,以便可以将其用于填充结构化信息存储库,例如关系数据库。 IE分为几个子任务,每个子任务旨在从文本中提取不同的结构,例如实体,关系,或更复杂的结构(例如本体)。在本文中,术语``信息提取 u27 u27(以某种方式任意使用)仅用于识别被制定为emph {序列标记}任务的子任务。最近,解决IE的主要方法依赖于受监督的机器学习,后者需要人工标记的数据示例,以训练从尚未看到的数据中提取信息的系统。当将IE作为序列标记任务处理时(例如,在emph {命名实体识别},emph {概念提取}中,在某些情况下在emph {opinion挖掘}中使用),肯定是性能最佳的监督机器学习方法之一emph {概率图形模型},尤其是emph {条件随机场}(CRF)。在本文中,我们研究了与通过CRF从文本中提取信息有关的两个主要方面:创建优于常用的CRF模型,最新技术,``线性链 u27 u27 CRF''以及其影响数据质量对IE的CRFs系统准确性的影响。在本文的第一部分中,我们使用CRF框架的功能来创建新的CRF(即两阶段,整体,多标签,分层),这与通常采用的线性链CRF具有自定义结构不同适合考虑的任务。我们在两种不同的任务上举例说明了这种方法,即医疗文档中的IE和产品评论中的观点挖掘。如果训练数据的质量低下,CRF就像任何基于机器学习的方法一样,可能会受到影响。因此,本论文的第二部分致力于(1)研究训练数据的质量如何影响IE的CRFs系统的准确性; (2)通过半监督的Emph {active learning}(AL)生成人工注释的训练数据。

著录项

  • 作者

    Marcheggiani Diego;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号