首页> 外文学位 >Data De-Duplication through Active Learning.
【24h】

Data De-Duplication through Active Learning.

机译:通过主动学习进行重复数据删除。

获取原文
获取原文并翻译 | 示例

摘要

Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set.;The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing.;Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference.;We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds.;This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset.
机译:重复数据删除涉及特定数据集中记录的标识和最终消除,这些记录引用相同的实体,而不必具有相同的属性值或相同的标识值。机器学习技术已用于处理重复数据删除。使用集成学习方法的主动学习就是这样一种技术。集成学习算法用于根据同一训练集创建一组不同的模型。然后,Active Learning将未标记的记录对迭代传递给创建的模型,以将其标记为重复项或非重复项,并有选择地选择引起模型之间最大分歧的记录对。所选的成对实例被认为可以为学习过程带来最多的信息收益。因此,主动学习不断教会学习者通过为学习者提供更好的训练集来查找重复的实例。实验结果表明,使用Bagging Query进行主动学习在合成数据集上表现良好,仅需进行几次迭代即可生成良好的解集。复制功能。数据集的大小似乎对结果没有太大影响。当对真实数据进行实验时,使用“按装袋查询”的主动学习仍然表现良好,除非数据集的噪声很大。但是,与使用合成数据时相比,现实世界数据的学习过程并不那么顺利。评估了使用树冠聚类和Bigram Indexing阻塞方法的性能,结果显示了Bigram Indexing的更好结果。使用Boosting Query进行的主动学习在合成数据集上表现出良好的性能。它还可以在现实世界的数据集上产生良好的结果。但是,数据集中的噪声会负面影响学习过程的性能。同样,使用Boosting查询时,数据集大小不会影响性能。使用Canopy聚类和Bigram索引对重复数据删除功能的评估没有显着差异。;我们进一步比较了使用Bagging查询和Boosting查询时的性能结果。首先,当使用两种不同的阻塞方法比较这两种方法时,实验表明,通过Boosting进行查询对于树冠聚类和Bigram索引均产生更好的结果。当考虑合成数据与现实数据时,也存在相同的观察结果。本文评估了当使用Bagging Query和Boosting Query算法时,主动学习如何承担重复数据删除任务。在评估过程中,我们调查了各种情况下主动学习的表现。我们研究了更改数据大小的影响以及使用不同的分块方法的影响,分块方法是用于减少潜在重复数据进行比较的方法。我们还考虑了使用合成数据集与实际数据集时主动学习的性能。

著录项

  • 作者

    Muhivuwomunda, Divine.;

  • 作者单位

    University of Ottawa (Canada).;

  • 授予单位 University of Ottawa (Canada).;
  • 学科 Computer Science.
  • 学位 M.C.S.
  • 年度 2010
  • 页码 99 p.
  • 总页数 99
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号