首页> 外文期刊>Datenbank-Spektrum >An Efficient Blocking Technique for Reference Matching using MapReduce
【24h】

An Efficient Blocking Technique for Reference Matching using MapReduce

机译:使用MapReduce进行参考匹配的有效阻止技术

获取原文
获取原文并翻译 | 示例

摘要

Document Clustering has become an increasingly important task in the area of data mining and information retrieval. With growing data volumes, CPU—and memory-efficient techniques for clustering algorithms are receiving considerable attention in the research community. To deal with huge amounts of data (e.g., documents from Wikipedia or CiteSeerX which are several GB in size), distributed clustering techniques have been designed to provide scalable and flexible approaches. We study the problem of document clustering in the area of Entity Matching, where documents from various data sources are matched together. More specifically, we focus on a common optimization technique called blocking which reduces the enormous search space by clustering the data sources into smaller groups and processes comparisons only within a group. In this article, we describe our experiences and findings in applying the MapReduce framework to deal with huge bibliographic data sets and to provide a flexible, scalable and easy-to-use blocking technique to reduce the search space for Entity Matching.
机译:文档聚类已成为数据挖掘和信息检索领域中越来越重要的任务。随着数据量的增长,CPU和用于群集算法的内存有效技术在研究界引起了极大关注。为了处理大量数据(例如,来自Wikipedia或CiteSeerX的文件,大小为数GB),分布式集群技术已被设计为提供可伸缩和灵活的方法。我们研究了实体匹配领域中的文档聚类问题,其中来自各种数据源的文档被匹配在一起。更具体地说,我们集中于一种称为阻塞的通用优化技术,该技术通过将数据源聚集到较小的组中并仅在组内进行比较来减少巨大的搜索空间。在本文中,我们描述了将MapReduce框架用于处理大量书目数据集并提供灵活,可扩展且易于使用的阻止技术以减少实体匹配搜索空间的经验和发现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号