...
首页> 外文期刊>International Journal of Computer Trends and Technology >An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
【24h】

An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

机译:基于KFINDMR的基于改进遗传规划方法的重复数据删除

获取原文

摘要

The record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, types, different writing styles or even different schema representations or data types. In existing system aims at providing Unsupervised Duplication Detection (UDD)method which can be used to identify and remove the duplicate records from different data sources. Starting from the non duplicate set, the two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and present a genetic programming (GP) approach to record deduplication. Their GP-based approach is also able to automatically find effective deduplication functions. The genetic programming approach is time consuming task so we propose new algorithmKFINDMR(KFIND using Most Represented data samples)to find the most represented data samples to improve the accuracy of the classifier. The proposed system calculates the mean value of the mostrepresented data samples in centroid of the record members; it selects the first most represented data sample that closest to the mean value calculates the minimum distance. The system Remove the duplicate dataset samples in the system and find the optimization solution to deduplication of records or data samples
机译:记录重复数据删除是在数据存储库中识别引用相同现实世界实体或对象的记录的任务,尽管这些单词,类型,拼写样式或什至是不同的模式表示或数据类型拼写错误。现有系统中的目标是提供无监督重复检测(UDD)方法,该方法可用于识别和删除来自不同数据源的重复记录。从非重复集开始,使用两个协作分类器,即加权分量相似度总和分类器(WCSS)和支持向量机(SVM)从非重复记录中迭代识别重复记录,并提出一种遗传规划(GP)方法记录重复数据删除。他们基于GP的方法还能够自动找到有效的重复数据删除功能。遗传规划方法是一项耗时的任务,因此我们提出了一种新算法KFINDMR(使用最代表数据样本的KFIND)来查找最代表数据样本,以提高分类器的准确性。所提出的系统以记录成员的质心为单位计算最有代表性的数据样本的平均值。它选择最接近平均值并计算出最小距离的第一个最具代表性的数据样本。系统删除系统中重复的数据集样本,并找到对记录或数据样本进行重复数据删除的优化解决方案

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号