首页> 外文会议>IEEE International Conference on Data Engineering >Canonicalization of Open Knowledge Bases with Side Information from the Source Text
【24h】

Canonicalization of Open Knowledge Bases with Side Information from the Source Text

机译:开放知识库的Canonicalization来自源文本的侧面信息

获取原文

摘要

Nowadays Open Information Extraction (Open IE) approaches, which extract <;noun phrase, relation phrase, noun phrase> triples from unstructured text, contribute to the construction of large Open Knowledge Bases (Open KBs). However, one crucial problem is that the noun phrases and relation phrases in the extracted triples are not well canonicalized, which leads to a large number of redundant and ambiguous facts. For example, both <;Barack Obama, was born in, Honolulu> and <;President Obama, has birthplace, Honolulu> may be extracted and stored in Open KBs. Recent research proposes to solve this problem by clustering over manually-defined feature spaces based on the similarity of the noun phrases and relation phrases. However, the performance of such techniques is limited, since only the information contained in the triples is utilized to measure their similarity. In this paper, we propose to perform canonicalization over Open IE triples by incorporating the side information from the original data sources, including the candidate entities of the noun phrases detected in the source text, the types of the candidate entities and the domain knowledge of the source text. We model the canonicalization problem of noun phrases and relation phrases jointly based on such side information, and demonstrate the effectiveness of our approach through extensive experiments on two real-world datasets.
机译:如今开放信息提取(开放IE)方法,其中提取<; noun短语,关系短语,noun短语>从非结构化文本中的三元组,有助于构建大开放知识库(开放kbs)。然而,一个至关重要的问题是提取的三元组中的名词短语和关系短语不是很好的规范化,这导致大量的冗余和含糊不清的事实。例如,两个<;巴拉克奥巴马出生在檀香山>和<;奥巴马总统,有发源地,檀香山>可以提取并储存在开放的KBS中。最近的研究建议通过基于名词短语和关系短语的相似性聚类通过手动定义的特征空间来解决此问题。然而,这种技术的性能是有限的,因为仅利用三元组中包含的信息来测量它们的相似性。在本文中,我们提出通过从原始数据源中的侧面信息结合来自原始数据源的侧面信息,包括在源文本中检测到的名词短语的候选实体,候选实体的类型和域的类型的候选实体来执行Canonicalization源文本。我们基于这些侧面信息共同模拟名词短语和关系短语的规范化问题,并通过对两个现实世界数据集的大量实验来展示我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号