A practical and effective sampling selection strategy for large scale deduplication

机译：大规模重复数据删除的实用有效的采样选择策略

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Record deduplication aims at identifying entities that are potentially the same in a data repository. A set of pairs that is manually labeled is generally used to tune the deduplication process, as each dataset has a particular dirtiness pattern. However, producing an informative set of pairs is a very costly task, especially in very large datasets (even for expert users). We propose a new sampling strategy that is able to select a very small and informative set of pairs from large datasets. Our results show that our approach reduces user effort substantially while achieving a competitive or superior matching quality.

机译：记录重复数据删除旨在识别数据存储库中可能相同的实体。手动标记的一组对通常用于调整重复数据删除过程，因为每个数据集具有特定的贫瘠模式。然而，产生信息集的对是一个非常昂贵的任务，特别是在非常大的数据集中（即使是专家用户）。我们提出了一种新的采样策略，能够从大型数据集中选择一个非常小而信息的对组。我们的研究结果表明，我们的方法大大减少了用户努力，同时实现了竞争或卓越的匹配质量。

著录项

来源
《IEEE International Conference on Data Engineering》|2016年||共2页
会议地点
作者
Guilherme Dal Bianco; Renata Galante; Carlos A. Heuser; Marcos Gon?alves; Sergio Canuto;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词

相似文献

外文文献
中文文献
专利

1. A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication [J] . Bianco Guilherme Dal, Galante Renata, Goncalves Marcos Andre, Knowledge and Data Engineering, IEEE Transactions on . 2015,第9期

机译：大规模重复数据删除的一种实用有效的抽样选择策略
2. Modelling a two-dimensional spatial distribution of mycotoxin concentration in bulk commodities to design effective and efficient sample selection strategies [J] . M. Rivas Casado, D.J. Parsons, R.M. Weightman, Food Additives & Contaminants . 2009,第9期

机译：对散装商品中霉菌毒素浓度的二维空间分布建模，以设计有效且高效的样本选择策略
3. Micro-scale quantitation of ten phthalate esters in water samples and cosmetics using capillary liquid chromatography coupled to UV detection: effective strategies to reduce the production of organic waste [J] . Chia-Hsien Feng, Shin-Ruei Jiang Mikrochimica Acta: An International Journal for Physical and Chemical Methods of Analysis . 2012,第1a2期

机译：使用毛细管液相色谱和紫外检测对水样和化妆品中的十种邻苯二甲酸酯进行微量定量分析：减少有机废物产生的有效策略
4. A practical and effective sampling selection strategy for large scale deduplication [C] . Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser, IEEE International Conference on Data Engineering . 2016

机译：大规模重复数据删除的一种实用有效的抽样选择策略
5. The practical application of Vectar Processed densities in proving the lateral continuity of coal Zones and Samples in the Ellisras Basin, South Africa in support of effective Mineral Resource adjudication. [D] . Sullivan, John Hendrey. 2014

机译：Vectar处理的密度在证明南非Ellisras盆地煤层和样品的横向连续性方面的实际应用，以支持有效的矿产资源裁决。
6. Site selection in community-based clinical trials for substance use disorders: Strategies for effective site selection [O] . Jennifer Sharpe Potter, Dennis Donovan, Roger D. Weiss, -1

机译：策略有效的选址：在物质使用障碍以社区为基础的临床试验选址
7. Modelling a two-dimensional spatial distribution of mycotoxin concentration inbulk commodities to design effective and efficient sample selection strategies [O] . Rivas Casado Monica, Parsons David J., Magan Naresh, 2009

机译：建模中真菌毒素浓度的二维空间分布大宗商品以设计有效且高效的样本选择策略

A practical and effective sampling selection strategy for large scale deduplication

摘要

著录项

相似文献

相关主题

期刊订阅