首页> 外文学位 >Graph-based approaches to resolve entity ambiguity.
【24h】

Graph-based approaches to resolve entity ambiguity.

机译:解决实体歧义的基于图的方法。

获取原文
获取原文并翻译 | 示例

摘要

Information extraction is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. One of the challenges of Information Extraction is to resolve ambiguity between entities either in a knowledge base or in text documents. There are many variations of this problem and it is known under different names, such as coreference resolution, entity disambiguation, entity linking, entity matching, etc. For example, the task of coreference resolution decides whether two expressions refer to the same entity; entity disambiguation determines how to map an entity mention to an appropriate entity in a knowledge base (KB); the main focus of entity linking is to infer that two entity mentions in a document(s) refer to the same real world entity even if they do not appear in a KB; entity matching (also record deduplication, entity resolution, reference reconciliation) is to merge records from databases if they refer to the same object.;Resolving ambiguity and finding proper matches between entities is an important step for many downstream applications, such as data integration, question answering, relation extraction, etc. The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains, posing a scalability challenge for Information Extraction systems. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and to answer complex queries. However the efficient alignment of large-scale knowledge bases still poses a considerable challenge.;Various aspects and different settings to resolve ambiguity between entities are studied in this dissertation. A new scalable domain-independent graph-based approach utilizing Personalized Page Rank is developed for entity matching across large-scale knowledge bases and evaluated on datasets of 110 million and 203 million entities. A new model for entity disambiguation between a document and a knowledge base utilizing a document graph and effectively filtering out noise is proposed; corresponding datasets are released. A competitive result of 91.7% in microaccuracy on a benchmark AIDA dataset is achieved, outperforming the most recent state-of-the-art models. A new technique based on a paraphrase detection model is proposed to recognize name variations for an entity in a document. Corresponding training and test datasets are made publicly available. A new approach integrating a graph-based entity disambiguation model and this technique is presented for an entity linking task and is evaluated on a dataset for the Text Analysis Conference Entity Discovery and Linking task.
机译:信息提取是从非结构化或半结构化的机器可读文档中自动提取结构化信息的任务。信息提取的挑战之一是解决知识库或文本文档中实体之间的歧义。此问题有很多变体,并且以不同的名称来了解,例如共引用解析,实体歧义消除,实体链接,实体匹配等。例如,共引用解析的任务确定两个表达式是否引用同一实体;例如,实体歧义消除确定如何将实体提及映射到知识库(KB)中的适当实体;实体链接的主要重点是推断文档中提到的两个实体是指同一真实世界实体,即使它们没有出现在KB中也是如此;实体匹配(也包括记录重复数据删除,实体解析,引用对帐)是合并数据库中的记录(如果它们引用同一对象)。解决歧义并找到实体之间的正确匹配是许多下游应用程序(例如数据集成)的重要步骤,互联网已使人们能够在各种领域中创建越来越多的大型知识库,从而给信息提取系统带来了可扩展性挑战。自动调整这些知识库的工具将使统一许多结构化知识的来源并回答复杂的查询成为可能。然而,大规模知识库的有效对齐仍然带来相当大的挑战。本文研究了解决实体之间歧义性的各种方面和不同设置。针对个性化大型知识库的实体,开发了一种新的可扩展的,与域无关的基于图的,基于个性化页面排名的方法,并在1.1亿和2.03亿个实体的数据集上进行了评估。提出了一种利用文档图有效消除噪声的文档和知识库实体消歧模型。释放相应的数据集。在基准AIDA数据集上,其微精度的竞争结果达到了91.7%,优于最新的模型。提出了一种基于释义检测模型的新技术来识别文档中实体的名称变化。相应的培训和测试数据集是公开可用的。提出了一种新的方法,该方法集成了基于图的实体消歧模型,并且针对实体链接任务提出了此技术,并针对文本分析会议实体发现和链接任务对数据集进行了评估。

著录项

  • 作者

    Pershina, Maria.;

  • 作者单位

    New York University.;

  • 授予单位 New York University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 94 p.
  • 总页数 94
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号