Data De-Duplication through Active Learning.

机译：通过主动学习进行重复数据删除。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set.;The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing.;Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference.;We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds.;This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset.

机译：重复数据删除涉及特定数据集中记录的标识和最终消除，这些记录引用相同的实体，而不必具有相同的属性值或相同的标识值。机器学习技术已用于处理重复数据删除。使用集成学习方法的主动学习就是这样一种技术。集成学习算法用于根据同一训练集创建一组不同的模型。然后，Active Learning将未标记的记录对迭代传递给创建的模型，以将其标记为重复项或非重复项，并有选择地选择引起模型之间最大分歧的记录对。所选的成对实例被认为可以为学习过程带来最多的信息收益。因此，主动学习不断教会学习者通过为学习者提供更好的训练集来查找重复的实例。实验结果表明，使用Bagging Query进行主动学习在合成数据集上表现良好，仅需进行几次迭代即可生成良好的解集。复制功能。数据集的大小似乎对结果没有太大影响。当对真实数据进行实验时，使用“按装袋查询”的主动学习仍然表现良好，除非数据集的噪声很大。但是，与使用合成数据时相比，现实世界数据的学习过程并不那么顺利。评估了使用树冠聚类和Bigram Indexing阻塞方法的性能，结果显示了Bigram Indexing的更好结果。使用Boosting Query进行的主动学习在合成数据集上表现出良好的性能。它还可以在现实世界的数据集上产生良好的结果。但是，数据集中的噪声会负面影响学习过程的性能。同样，使用Boosting查询时，数据集大小不会影响性能。使用Canopy聚类和Bigram索引对重复数据删除功能的评估没有显着差异。;我们进一步比较了使用Bagging查询和Boosting查询时的性能结果。首先，当使用两种不同的阻塞方法比较这两种方法时，实验表明，通过Boosting进行查询对于树冠聚类和Bigram索引均产生更好的结果。当考虑合成数据与现实数据时，也存在相同的观察结果。本文评估了当使用Bagging Query和Boosting Query算法时，主动学习如何承担重复数据删除任务。在评估过程中，我们调查了各种情况下主动学习的表现。我们研究了更改数据大小的影响以及使用不同的分块方法的影响，分块方法是用于减少潜在重复数据进行比较的方法。我们还考虑了使用合成数据集与实际数据集时主动学习的性能。

著录项

作者
Muhivuwomunda, Divine.;
展开▼
作者单位

University of Ottawa (Canada).;

展开▼
授予单位 University of Ottawa (Canada).;
学科 Computer Science.
学位 M.C.S.
年度 2010
页码 99 p.
总页数 99
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A study on data de-duplication schemes in cloud storage [J] . Priyan Malarvizhi Kumar, G. Usha Devi, Shakila Basheer, International Journal of Grid and Utility Computing . 2020,第4期

机译：云存储中数据解复复方案的研究
2. Attributes Based Storage System for Secure De-Duplication of Encrypt Data in Cloud [J] . S. Sivasankari, V. Lavanya, G. Saranya, Journal of computational and theoretical nanoscience . 2020,第4期

机译：基于属性的存储系统，用于安全重复云中的加密数据
3. An Efficient Privacy-Preserving Data De-Duplication in Cloud [J] . Bharanidharan M., Karunakaran E. International Journal of Applied Engineering Research . 2019,第8aPta1期

机译：云中有效保护数据重复复制
4. Flipping a Hardware Design Class: An Encouragement of Active Learning. Should it Continue? [C] . Nader Rafla, H. Shelton Jacinto American Society for Engineering Education Annual Conference and Exposition . 2019

机译：翻转硬件设计课程：激励积极学习。它应该继续吗？
5. Building Adaptive Computational Systems for Physiological and Biomedical Data via Transfer and Active Learning. [D] . Chattopadhyay, Rita. 2013

机译：通过转移和主动学习为生理和生物医学数据构建自适应计算系统。
6. A proficient cost reduction framework for de-duplication of records in data integration [O] . Asif Sohail, Muhammad Murtaza Yousaf 2016

机译：一个精通的成本降低框架用于数据集成中的重复数据删除
7. Learning in robotics. A Perspective on Robot Learning. Towards Active Learning. [O] . Kazuo Hiraki, Hitoshi Matsubara 1995

机译：在机器人学中学。关于机器人学习的透视。走向积极的学习。

Data De-Duplication through Active Learning.

摘要

著录项

相似文献

相关主题

期刊订阅