...
首页> 外文期刊>Data & Knowledge Engineering >Learning soft domain constraints in a factor graph model for template-based information extraction
【24h】

Learning soft domain constraints in a factor graph model for template-based information extraction

机译:在因子图模型中学习软域约束以进行基于模板的信息提取

获取原文
获取原文并翻译 | 示例
           

摘要

The ability to accurately extract key information from textual documents is necessary in several downstream applications e.g., automatic knowledge base population from text, semantic information retrieval, question answering, or text summarization. However, information extraction (IE) systems are far from being errorless and in some cases commit errors that seem obvious to a human expert as they violate common sense or domain knowledge.Towards improving the performance of IE systems, we focus on the question of how domain knowledge can be incorporated into IE models to reduce the number of spurious extractions. Starting from the assumption that such domain knowledge cannot be incorporated explicitly and manually by domain experts due to the amount of effort and technical complexities involved, we propose a machine learning approach in which domain constraints are acquired as a byproduct of learning a model that learns to extract key information in a supervised setting. We frame the task as a template-based information extraction problem in which several dependent slots need to be automatically filled and propose a factor graph based approach to model the joint distribution of slot assignments given a text. Beyond using standard textual features in factors that score the compatibility of slot fillers in relation to the text, we use additional features that are text-independent and capture soft domain constraints. During the training process, these constraints receive a weight as part of the parameter learning process indicating how strongly a constraint should be enforced. These domain constraints are thus 'soft' in the sense that they can be violated, but the system learns to penalize solutions that violate them. The soft constraints we introduce come in two flavors: on the one hand we incorporate information about the mean of numerical attributes and use features that indicate how far a certain value is from the mean. We call these features single slot soft constraints. On the other hand, we model the pairwise compatibility between slot filler assignments independent of the textual context, thus modeling the (domain) compatibility of the slot assignments, We call the latter ones pairwise slot soft constraints. As main result of our work, we show that learning pairwise slot soft constraints improves the performance of our extraction model compared to single slot soft constraints by up to 6 points in F-1, leading to an F-1 score of 0.91 for individual template types. Further, the human readable output format of our model enables the extraction and interpretation of the learned soft constraints. Based on this, we show in an evaluation by domain experts that more than 68% of the learned soft constraints are regarded as plausible.
机译:在几个下游应用程序中,例如从文本中自动提取知识库,语义信息检索,问题回答或文本摘要,必须具有从文本文档中准确提取关键信息的能力。但是,信息提取(IE)系统远非完美无缺,在某些情况下会犯一些对于人类专家来说是显而易见的错误,因为它们违反了常识或领域知识。为了提高IE系统的性能,我们着重研究如何解决问题。领域知识可以合并到IE模型中,以减少虚假提取的次数。从这样的假设开始,即由于涉及的工作量和技术复杂性,领域专家无法明确和手动地合并此类领域知识,因此,我们提出了一种机器学习方法,其中获取领域约束作为学习模型的副产品,从而获得领域约束。在监督的环境中提取关键信息。我们将任务构造为基于模板的信息提取问题,在该问题中需要自动填充几个相关的插槽,并提出基于因子图的方法来对给定文本的插槽分配的联合分布进行建模。除了使用标准文本功能来衡量与文字相关的广告位填充符兼容性以外,我们还使用其他独立于文本的功能并捕获软域约束。在训练过程中,这些约束作为参数学习过程的一部分获得权重,指示应强制执行约束的强度。这些域约束因此在可以被违反的意义上是“软”的,但是系统学会了对违反它们的解决方案进行惩罚。我们引入的软约束有两种形式:一方面,我们合并了有关数值属性均值的信息,并使用了指示某个值与均值相差多远的特征。我们称这些功能为单插槽软约束。另一方面,我们独立于文本上下文对时隙填充符分配之间的成对兼容性进行建模,从而对时隙分配的(域)兼容性进行建模,我们将后者称为成对时隙软约束。作为我们工作的主要结果,我们表明,与成对的单时隙软约束相比,学习成对的时隙软约束可以将提取模型的性能提高F-1的6点,从而使单个模板的F-1得分为0.91类型。此外,我们模型的人类可读输出格式能够提取和解释学习到的软约束。基于此,我们在领域专家的评估中表明,超过68%的学习到的软约束被认为是合理的。

著录项

  • 来源
    《Data & Knowledge Engineering》 |2020年第1期|101764.1-101764.17|共17页
  • 作者

  • 作者单位

    Bielefeld Univ CITEC Inspirat 1 D-33619 Bielefeld Germany;

    Semalytix GmbH Meller Str 2 D-33613 Bielefeld Germany;

    Heinriche Heine Univ Dusseldorf Univ Klinikum Neurol Klin Moorenstr 5 D-40225 Dusseldorf Germany|Life Sci Ctr Dusseldorf Ctr Neuronal Regenerat Merowingerpl 1a D-40225 Dusseldorf Germany;

    Univ Stuttgart Inst Maschinelle Sprachverarbeitung Pfaffenwaldring 56 D-70569 Stuttgart Germany;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Template-based information extraction; Slot-filling; Probabilistic graphical models; Learning domain constraints; Database population;

    机译:基于模板的信息提取;插槽填充;概率图形模型;学习领域的约束;数据库总数;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号