【24h】

OpenCercs: When Open Information Extraction Meets the Semi-Structured Web

机译:OpenCercs:当开放信息提取遇到半结构化网络时

获取原文

摘要

Open Information Extraction (OpenlE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenlE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark datasel obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.
机译:在过去十年中,开放信息提取(OpenlE)是从谓词关系与任何预定义的本体不符的自然语言文本中获取三元组的问题,一直是研究的热门话题。但是,这项研究在很大程度上忽略了半结构化网页中的大量事实。在本文中,我们从半结构化的网站定义了OpenlE的问题,以提取此类事实,并提出了解决该问题的方法。我们还引入了标记的评估数据集,以激励该领域的研究。给定一个半结构化网站并在其页面上存在一些关系的种子事实集,我们采用半监督标签传播技术为该网站上存在的关系自动创建训练数据。然后,我们使用此训练数据来学习用于关系提取的分类器。该方法在我们新的基准数据上的实验结果获得了70%以上的精度。在电影行业的31个网站上进行了较大规模的提取实验,结果提取了超过200万个三元组。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号