
Improving Knowledge Base Construction from Robust Infobox Extraction




A capable, automatic Question Answering (QA) system can provide more complete and accurate answers using a comprehensive knowledge base (KB). One important approach to constructing a comprehensive knowledge base is to extract information from Wikipedia infobox tables to populate an existing KB. Despite previous successes in the Infobox Extraction (IBE) problem (e.g., DB-pedia), three major challenges remain: 1) Deterministic extraction patterns used in DBpe-dia are vulnerable to template changes; 2) Over-trusting Wikipedia anchor links can lead to entity disambiguation errors; 3) Heuristic-based extraction of unlinkable entities yields low precision, hurting both accuracy and completeness of the final KB. This paper presents a robust approach that tackles all three challenges. We build probabilistic models to predict relations between entity mentions directly from the infobox tables in HTML. The entity mentions are linked to identifiers in an existing KB if possible. The unlinkable ones are also parsed and preserved in the final output. Training data for both the relation extraction and the entity linking models are automatically generated using distant supervision. We demonstrate the empirical effectiveness of the proposed method in both precision and recall compared to a strong IBE baseline, DBpe-dia, with an absolute improvement of 41.3% in average F_1. We also show that our extraction makes the final KB significantly more complete, improving the completeness score of list-value relation types by 61.4%.
机译:功能强大的自动问答系统(QA)可以使用综合知识库(KB)提供更完整和准确的答案。构建综合知识库的一种重要方法是从Wikipedia信息框表中提取信息,以填充现有的知识库。尽管先前在信息框提取(IBE)问题(例如DB-pedia)方面取得了成功,但仍存在三个主要挑战:1)DBpe-dia中使用的确定性提取模式易受模板更改的影响; 2)过度信任Wikipedia锚链接可能导致实体歧义错误; 3)基于启发式的不可链接实体提取产生较低的精度,从而损害了最终知识库的准确性和完整性。本文提出了一种可解决所有三个挑战的强大方法。我们建立概率模型,以直接从HTML的信息框表中预测实体提及之间的关系。如果可能,将实体提及链接到现有KB中的标识符。不可链接的内容也将被解析并保留在最终输出中。使用远程监督自动生成关系提取和实体链接模型的训练数据。我们证明了与强IBE基线DBpe-dia相比,该方法在精确度和召回率上的经验有效性,平均F_1绝对提高了41.3%。我们还表明,我们的提取使最终的知识库显着更完整,将列表-值关系类型的完整性得分提高了61.4%。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号