首页> 外文OA文献 >Iterative semantic information extraction from unstructured text sources
【2h】

Iterative semantic information extraction from unstructured text sources

机译:从非结构化文本源中提取迭代语义信息

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Nowadays we generate an enormous amount of data and most of it is unstructured. The users of Internet post more than 200,000 text documents and together write more than 200 million e-mails online every single minute. We would like to access this data in a structured form and that is why we throughout this dissertation deal with information extraction from text sources. Information extraction is a type of information retrieval, where the main tasks are named entity recognition, relationship extraction, and coreference resolution. The dissertation consists of the four main chapters, where each of them represents a separate information extraction task and the last chapter which introduces a combination all of the three tasks into an iterative method within an end-to-end information extraction system. First we introduce the task of coreference resolution with its goal of merging all of the mentions that refer to a specific entity. We propose SkipCor system that casts the task into a sequence tagging problem for which first order probabilistic models can be used. To enable the detection of distant coreferent mentions we propose an innovative transformation into skip-mention sequences and achieve comparable or better results than other known approaches. We also use a similar transformation for relationship extraction. There we use different tags and rules that enable the extraction of hierarchical relationships. The proposed solution achieves the best result at the relationship extraction challenge between genes that form a gene regulations network. Lastly we present the oldest and most thoroughly researched task of named entity recognition. The task deals with a tagging of one or more words that represent a specific entity type - for example, persons. In the dissertation we adapt the use of standard procedures for the sequence tagging tasks and achieve the seventh rank at the chemical compound and drug name recognition challenge. We successfully manage to solve all of the three problems using linear-chain conditional random fields models. We combine the tasks in an iterative method that accepts an unstructured text as input and returns extracted entities along with relationships between them. The output is represented according to a system ontology which provides better data interoperability. The information extraction field for the Slovene language is not yet well researched which is why we also include a list of translations of the selected terms from English to Slovene.
机译:如今,我们生成了大量的数据,其中大多数是非结构化的。互联网的用户每分钟在线发布200,000多个文本文档,并在线撰写超过2亿封电子邮件。我们希望以结构化的形式访问此数据,这就是为什么我们在整个论文中都处理从文本源中提取信息的原因。信息提取是一种信息检索,其中的主要任务称为实体识别,关系提取和共指解析。论文由四个主要章节组成,其中每个章节代表一个单独的信息提取任务,最后一章将这三个任务的组合引入到端到端信息抽取系统中的迭代方法中。首先,我们介绍共引用解决方案的任务,其目标是合并所有引用特定实体的提及。我们提出了SkipCor系统,该系统将任务转换为可以使用一阶概率模型的序列标记问题。为了能够检测到远处的相关参考文献,我们提出了一种创新的转换成跳跃序列的方法,并获得了与其他已知方法相当或更好的结果。我们还使用类似的变换进行关系提取。在这里,我们使用不同的标记和规则来提取层次关系。所提出的解决方案在形成基因调控网络的基因之间的关系提取挑战中获得了最佳结果。最后,我们介绍命名实体识别的最古老,研究最彻底的任务。该任务处理一个或多个代表特定实体类型(例如,人)的单词的标记。在本文中,我们将标准程序的使用调整为序列标记任务,并在化合物和药物名称识别挑战中获得第七名。我们使用线性链条件随机场模型成功地解决了所有三个问题。我们将这些任务组合为一个迭代方法,该方法接受非结构化文本作为输入,并返回提取的实体以及它们之间的关系。根据提供更好的数据互操作性的系统本体表示输出。斯洛文尼亚语言的信息提取领域尚未得到很好的研究,这就是为什么我们还列出了所选术语从英语到斯洛文尼亚的翻译列表的原因。

著录项

  • 作者

    Žitnik Slavko;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号