...
首页> 外文期刊>Frontiers of computer science in China >EnAli: entity alignment across multiple heterogeneous data sources
【24h】

EnAli: entity alignment across multiple heterogeneous data sources

机译:EnAli:跨多个异构数据源的实体对齐

获取原文
获取原文并翻译 | 示例

摘要

Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
机译:实体对齐是确定数据源中哪些实体引用其他实体中的同一真实世界实体的问题。跨异构数据源识别实体对于许多研究领域至关重要,例如数据清洁,数据集成,信息检索和机器学习。对齐过程不仅对于大型数据源而言是极其昂贵的,因为它涉及来自两个或多个数据源的所有元组,而且还需要处理异构实体属性。在本文中,我们提出了一种称为EnAli的无监督方法,以匹配两个或多个异构数据源中的实体。 EnAli使用生成概率模型,通过使用指数族来合并异构实体属性,处理缺失值,还利用局部敏感的哈希模式来减少候选元组并加快对齐过程。即使没有任何实际的元组,EnAli还是高度准确和高效的。我们将说明EnAli在重新标识同一数据源中的实体以及在三个真实数据源中对齐实体时的性能。我们的实验结果表明,我们提出的方法优于可比较的基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号