首页> 外文会议>IEEE International Conference on Data Engineering >A practical and effective sampling selection strategy for large scale deduplication
【24h】

A practical and effective sampling selection strategy for large scale deduplication

机译:大规模重复数据删除的实用有效的采样选择策略

获取原文

摘要

Record deduplication aims at identifying entities that are potentially the same in a data repository. A set of pairs that is manually labeled is generally used to tune the deduplication process, as each dataset has a particular dirtiness pattern. However, producing an informative set of pairs is a very costly task, especially in very large datasets (even for expert users). We propose a new sampling strategy that is able to select a very small and informative set of pairs from large datasets. Our results show that our approach reduces user effort substantially while achieving a competitive or superior matching quality.
机译:记录重复数据删除旨在识别数据存储库中可能相同的实体。手动标记的一组对通常用于调整重复数据删除过程,因为每个数据集具有特定的贫瘠模式。然而,产生信息集的对是一个非常昂贵的任务,特别是在非常大的数据集中(即使是专家用户)。我们提出了一种新的采样策略,能够从大型数据集中选择一个非常小而信息的对组。我们的研究结果表明,我们的方法大大减少了用户努力,同时实现了竞争或卓越的匹配质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号