...
首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication
【24h】

A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication

机译:大规模重复数据删除的一种实用有效的抽样选择策略

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets. T3S selects the most representative pairs by following two stages. In the first stage, we propose a strategy to produce balanced subsets of candidate pairs for labeling. In the second stage, an active selection is incrementally invoked to remove the redundant pairs in the subsets created in the first stage in order to produce an even smaller and more informative training set. This training set is effectively used both to identify where the most ambiguous pairs lie and to configure the classification approaches. Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art deduplication methods in large datasets.
机译:为了提供有效的解决方案,重复数据删除任务已引起研究界的广泛关注。用户提供的用于调整重复数据删除过程的信息通常由一组手动标记的对表示。在非常大的数据集中,生成这种标记集是一项艰巨的任务,因为它需要专家来选择和标记大量信息对。在本文中,我们提出了一种两阶段抽样选择策略(T3S),该策略选择一组简化的对来调整大型数据集中的重复数据删除过程。 T3S通过以下两个阶段选择最具代表性的对。在第一阶段,我们提出一种策略来生成候选对的平衡子集进行标记。在第二阶段中,将主动调用主动选择以删除在第一阶段中创建的子集中的冗余对,以生成更小,更有用的训练集。该训练集可有效地用于识别最模糊的对在何处以及配置分类方法。我们的评估表明,与大型数据集中最先进的重复数据删除方法相比,T3S能够大幅减少标记工作,同时获得具有竞争力或更高的匹配质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号