首页> 外文会议>IEEE International Conference on Data Engineering >Canonicalization of Open Knowledge Bases with Side Information from the Source Text
【24h】

Canonicalization of Open Knowledge Bases with Side Information from the Source Text

机译:具有源文本附带信息的开放式知识库的规范化

获取原文

摘要

Nowadays Open Information Extraction (Open IE) approaches, which extract <;noun phrase, relation phrase, noun phrase> triples from unstructured text, contribute to the construction of large Open Knowledge Bases (Open KBs). However, one crucial problem is that the noun phrases and relation phrases in the extracted triples are not well canonicalized, which leads to a large number of redundant and ambiguous facts. For example, both <;Barack Obama, was born in, Honolulu> and <;President Obama, has birthplace, Honolulu> may be extracted and stored in Open KBs. Recent research proposes to solve this problem by clustering over manually-defined feature spaces based on the similarity of the noun phrases and relation phrases. However, the performance of such techniques is limited, since only the information contained in the triples is utilized to measure their similarity. In this paper, we propose to perform canonicalization over Open IE triples by incorporating the side information from the original data sources, including the candidate entities of the noun phrases detected in the source text, the types of the candidate entities and the domain knowledge of the source text. We model the canonicalization problem of noun phrases and relation phrases jointly based on such side information, and demonstrate the effectiveness of our approach through extensive experiments on two real-world datasets.
机译:如今,开放信息提取(Open IE)方法从非结构化文本中提取<;名词短语,关系短语,名词短语>三元组,这为大型开放式知识库(Open KB)的建设做出了贡献。但是,一个关键问题是提取的三元组中的名词短语和关系短语的规范化程度不高,从而导致大量多余和模棱两可的事实。例如,都可以提取<; Barack Obama(出生于檀香山)>和<; Obama总统(出生地,檀香山)>并将其存储在Open KB中。最近的研究提出通过基于名词短语和关系短语的相似性在手动定义的特征空间上进行聚类来解决此问题。但是,由于仅利用三元组中包含的信息来测量其相似性,因此此类技术的性能受到限制。在本文中,我们建议通过合并来自原始数据源的辅助信息来对Open IE三元组执行规范化,这些辅助信息包括源文本中检测到的名词短语的候选实体,候选实体的类型以及该实体的领域知识。源文本。我们基于此类辅助信息共同对名词短语和关系短语的规范化问题进行建模,并通过在两个真实世界的数据集上进行的广泛实验证明了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号