首页> 外文会议>2017 International Conference on Intelligent Communication and Computational Techniques >Efficient entity resolution using multiple blocking keys for bibliographic dataset
【24h】

Efficient entity resolution using multiple blocking keys for bibliographic dataset

机译:使用多个阻止键对书目数据集进行有效的实体解析

获取原文
获取原文并翻译 | 示例

摘要

Entity Resolution(ER) is defined as identifying entities referring to the same real world object. The standard entity resolution process compare each entity with all other entities, which is inefficient for large datasets. A significant challenge in ER is to reduce the search space and execution time. The aim of this paper is to provide efficient entity resolution implementation for massive dataset by combining the use of multiple blocking key and parallel and distributed computing. In multiple blocking key concept, a record can belong to multiple blocks and it is possible that a record pair is generated multiple times for matching task. A solution to eliminate these duplicate pair is proposed, in addition to this character based similarity measure on sorted tokens is used for computing similarity between two record in the matching task. Efficient partitioning technique is used to remove the limitations of skewed dataset and matching task are evenly distributed among all the reducer. We used a bibliographic dataset in our experiment to show that our approach is less time consuming and scalable.
机译:实体解析度(ER)定义为标识引用同一真实世界对象的实体。标准实体解析过程将每个实体与所有其他实体进行比较,这对于大型数据集而言效率不高。 ER中的一项重大挑战是减少搜索空间和缩短执行时间。本文的目的是通过结合使用多个阻塞键以及并行和分布式计算,为海量数据集提供有效的实体解析实现。在多阻塞键概念中,一条记录可以属于多个块,并且有可能为匹配任务而多次生成一条记录对。提出了消除这些重复对的解决方案,此外,还使用了基于排序标记的基于字符的相似性度量来计算匹配任务中两个记录之间的相似性。有效的分区技术用于消除偏斜数据集的局限性,并将匹配任务平均分配到所有化简器中。我们在实验中使用书目数据集来表明我们的方法耗时少且可扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号