首页> 外文OA文献 >Web Relation Extraction with Distant Supervision
【2h】

Web Relation Extraction with Distant Supervision

机译:具有远距离监督的Web关系提取

摘要

Being able to find relevant information about prominent entities quickly is the main reason to use a search engine. However, with large quantities of information on the World Wide Web, real time search over billions of Web pages can waste resources and the end user’s time. One of the solutions to this is to store the answer to frequently asked general knowledge queries, such as the albums released by a musical artist, in a more accessible format, a knowledge base. Knowledge bases can be created and maintained automatically by using information extraction methods, particularly methods to extract relations between proper names (named entities). A group of approaches for this that has become popular in recent years are distantly supervised approaches as they allow to train relation extractors without text-bound annotation, using instead known relations from a knowledge base to heuristically align them with a large textual corpus from an appropriate domain. This thesis focuses on researching distant supervision for the Web domain. A new setting for creating training and testing data for distant supervision from the Web with entity-specific search queries is introduced and the resulting corpus is published. Methods to recognise noisy training examples as well as methods to combine extractions based on statistics derived from the background knowledge base are researched. Using co-reference resolution methods to extract relations from sentences which do not contain a direct mention of the subject of the relation is also investigated. One bottleneck for distant supervision for Web data is identified to be named entity recognition and classification (NERC), since relation extraction methods rely on it for identifying relation arguments. Typically, existing pre-trained tools are used, which fail in diverse genres with non-standard language, such as the Web genre. The thesis explores what can cause NERC methods to fail in diverse genres and quantifies different reasons for NERC failure. Finally, a novel method for NERC for relation extraction is proposed based on the idea of jointly training the named entity classifier and the relation extractor with imitation learning to reduce the reliance on external NERC tools. This thesis improves the state of the art in distant supervision for knowledge base population, and sheds light on and proposes solutions for issues arising for information extraction for not traditionally studied domains.
机译:能够快速找到有关重要实体的相关信息是使用搜索引擎的主要原因。但是,在Internet上有大量信息时,数十亿个Web页面的实时搜索会浪费资源和最终用户的时间。解决此问题的方法之一是以更易于访问的格式(知识库)存储对常见问题查询(如音乐艺术家发行的专辑)的回答。可以通过使用信息提取方法(尤其是提取专有名称(命名实体)之间的关系的方法)自动创建和维护知识库。近年来流行的一组方法是远程监督方法,因为它们允许训练没有文本绑定注释的关系提取器,而使用知识库中的已知关系试探性地将它们与适当的大型文本语料库对齐域。本文主要研究Web领域的远程监管。引入了一种新的设置,该设置用于创建培训和测试数据以通过特定于实体的搜索查询从Web进行远程监控,并发布结果语料库。研究了识别嘈杂训练实例的方法,以及基于背景知识的统计数据来组合提取的方法。还研究了使用共引用解决方法从不直接提及关系主题的句子中提取关系。 Web数据远程监视的一个瓶颈被标识为实体识别和分类(NERC),因为关系提取方法依靠它来识别关系参数。通常,使用现有的预培训工具,这些工具在使用非标准语言的多种类型(例如Web类型)中会失败。本文探讨了导致NERC方法失败的各种类型的原因,并量化了NERC失败的不同原因。最后,基于联合训练命名实体分类器和关系提取器与模仿学习的思想,提出了一种新的NERC关系提取方法,以减少对外部NERC工具的依赖。本文改进了对知识库人口进行远程监管的最新技术,并为未传统研究的领域的信息提取问题提供了启发并提出了解决方案。

著录项

  • 作者

    Augenstein Isabelle;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号