首页> 外文学位 >Entity information extraction using structured and semi-structured resources.
【24h】

Entity information extraction using structured and semi-structured resources.

机译:使用结构化和半结构化资源提取实体信息。

获取原文
获取原文并翻译 | 示例

摘要

Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value.;Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy.;Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system.;We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric.
机译:在信息提取中存在的所有任务中,实体链接(也称为实体消歧或实体解析)是一个新的重要问题,最近引起了自然语言处理(NLP)社区许多研究人员的注意。该任务涉及将命名实体的文字说明(例如人或电影名称)链接/匹配到数据库(例如Wikipedia或IMDB)中的适当条目。如果数据库不包含实体,则应返回NIL(数据库外)值。现有的用于链接文本中命名实体的技术主要集中在Wikipedia上,作为实体的目标目录。但是对于许多类型的实体,例如饭店和电影院,存在关系数据库,其中包含的信息比维基百科要广泛得多。在本文中,我们引入了一个称为开放数据库实体链接(Open-DB EL)的新框架,在该框架中,系统必须能够将命名实体解析为任意数据库中的符号,而无需为每个新数据库添加标签数据。在两个领域的实验中,我们的Open-DB EL策略在性能上比最先进的Wikipedia EL系统高25%以上。现有方法通常使用管道架构执行EL:它们使用命名实体识别( NER)系统找到文本中提及内容的边界,EL系统将提及内容与结构化或半结构化存储库(如Wikipedia)中的条目相关联。但是,这两个任务紧密相关,每种类型的系统都可以从另一种类型的信息中受益。我们提出并开发了一个称为NERREL的NER和EL联合模型,该模型吸收了典型NER系统中的大量候选者提及和EL系统中的大量候选实体链接,并将候选提及实体对对在一起进行排序联合预测。在跨三个数据集的NER和EL实验中,NEREL的性能明显优于或接近于两个最新NER​​系统的性能,并且优于6个竞争的EL系统。在基准MSNBC数据集NEREL上,与次优NER系统相比,其错误率降低了60%,与次优EL系统相比,其错误率降低了68%。实体信息提取的探索领域相对较少。以前有关从文本提取信息的大多数工作都集中在命名实体识别,实体链接和关系提取上。提取命名实体之间关系的时间范围的关注很少。例如,总统关系(John F. Kennedy,美国)仅在时间范围内(1961年1月20日至1963年11月22日)是正确的。在这篇论文中,我们提出了一个用于对关系事实进行时间范围界定的系统,称为TSRF,该系统基于最大的可用半结构化资源Wikipedia在远程监督下进行了训练。 TSRF采用的语言模型由从Wikipedia页面收集的句子中自动引导的模式组成,这些句子包含页面的主要实体以及从信息框元组提取的插槽填充符。此提议的系统在基准文本分析会议(TAC)2013数据集上针对临时空缺填充(TSF)的任务,在7个关系中的6个关系上获得了最新的结果。总体而言,该系统比参与TAC评估的次优系统高出TAC-TSF评估指标10分。

著录项

  • 作者

    Sil, Avirup.;

  • 作者单位

    Temple University.;

  • 授予单位 Temple University.;
  • 学科 Computer Science.;Information Science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 109 p.
  • 总页数 109
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:53:47

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号