首页> 外文学位 >Learning for information extraction: From named entity recognition and disambiguation to relation extraction.
【24h】

Learning for information extraction: From named entity recognition and disambiguation to relation extraction.

机译:学习信息提取:从命名实体识别和歧义消除到关系提取。

获取原文
获取原文并翻译 | 示例

摘要

Information Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples.;Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation.;Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy.;Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.
机译:信息提取是查找特定类型的实体及其关系的文本描述的任务,旨在以结构化格式表示文本文档中包含的信息,该结构化格式更适合于数据挖掘,问题回答或语义网中的应用。我们研究的目的是设计信息提取模型,通过利用以前方法中未曾探索过的证据类型来提高性能。由于通过领域专家的内省来设计提取系统是一项艰巨且耗时的过程,因此本文的重点将放在通过在带有手动标记的示例的数据集上进行训练来自动得出提取模型的方法。信息提取任务,与查找属于预定义类别集合的实体的文本提及有关。我们将此任务作为短语分类问题进行处理,其中将来自同一文档的候选短语归为一类。在使用关系马尔可夫网络的表达框架构建的模型中,捕获了候选实体之间的全局相关性。此外,我们提出了一种新颖的易于处理的短语分类方法,用于基于特殊的Junction Tree表示法对命名实体进行识别。将实体提及分类到一组预定义的类别中只能实现部分名称的歧义消除。这在“命名实体消除歧义”任务中得到了进一步完善,其中名称需要链接到其实际符号。在我们的研究中,我们将Wikipedia用作命名实体的存储库,并提出了一种消除歧义的排名方法,该方法利用了来自名称上下文的单词与Wikipedia分类中的类别之间的学习关联。关系提取是指查找文本中提及的实体之间的相关关系。文件。我们用于此信息提取任务的方法在所需的监督类型和数量上有所不同。我们首先提出两种在文档上训练的关系提取方法,其中为所需的关系手动注释句子。在第一种方法中,提取模式对应于锚定在同一句子中出现的两个实体名称上的单词和单词类别的序列。这些被用作广义子序列内核中的隐式特征,权重是通过训练支持向量机来计算的。在第二种方法中,隐式提取特征集中于句子的单词-单词依存关系图中两个实体之间的最短路径。最后,在与以往的关系提取学习方法大相径庭的情况下,我们建议将所需监管的数量减少到仅少数已知显示或不显示所需关系的实体对。每对都与从非常大的语料库中自动提取的一袋句子相关联。我们扩展了子序列内核以处理这种较弱的监督形式,并描述了一种加权特征的方法,以便将重点放在与目标关系相关联的特征上,而不是与单个实体相关联的特征上。由此产生的“多实例学习”方法提供了一种比以前的关系提取方法更具竞争力的替代方法,大大降低了人工监督的成本。

著录项

  • 作者

    Bunescu, Razvan Constantin.;

  • 作者单位

    The University of Texas at Austin.;

  • 授予单位 The University of Texas at Austin.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 168 p.
  • 总页数 168
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:39:23

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号