首页> 外文会议>IEEE International Conference on Data Mining Workshops >Methodology for Large-Scale Entity Resolution without Pairwise Matching
【24h】

Methodology for Large-Scale Entity Resolution without Pairwise Matching

机译:没有成对匹配的大规模实体解析方法

获取原文

摘要

Entity Resolution is the process of determining if two information system records are referring to the same entities, and is a crucial part in Information Quality research. The ER process becomes exponentially more complex and time consuming as datasets approach Big Data volumes. Due to the special characters of transitive closure in Entity Resolution and high volume of input data, traditional ER pairwise matching algorithms are not able to solve the problem efficiently. This paper presents a methodology to perform Entity Resolution without pairwise matching using match keys. Transitive closure occurs when each input reference can potentially create more than one match key. This paper also introduces a novel distributed parallel transitive closure algorithm in Entity Resolution context and an optimized version, which applies the method on multiple match keys. The implementation of the methodology is built upon the Hadoop MapReduce for distributed computation.
机译:实体解析是确定两个信息系统记录是否引用同一实体的过程,并且是信息质量研究中的关键部分。随着数据集接近大数据量,ER过程变得更加复杂和耗时。由于实体解析中传递闭包的特殊性和大量输入数据,传统的ER成对匹配算法无法有效解决该问题。本文介绍了一种无需使用匹配键即可进行成对匹配的实体解析方法。当每个输入参考可以潜在地创建多个匹配键时,就会发生传递关闭。本文还介绍了一种在实体解析上下文中的新颖分布式并行传递闭包算法和一种优化版本,该方法将该方法应用于多个匹配键。该方法的实现基于Hadoop MapReduce进行分布式计算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号