首页> 外文期刊>International Journal on Computer Science and Engineering >Detecting Duplicates and near Duplicates Records in Large Datasets
【24h】

Detecting Duplicates and near Duplicates Records in Large Datasets

机译:在大型数据集中检测重复记录和近重复记录

获取原文
       

摘要

The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Since the data sources are incoherent and autonomous, they may adopt their own conventions and often, integrating data from different sources may lead to erroneous redundancy of data. To ensure high quality data, the database must validate and filter the incoming data from the external sources. In this regard, data normalization has become a necessity to ensure the high quality of the data stored in these databases. The process of identifying the record pairs that represent the same entity is commonly known as duplicate record detection making it one of the most important tasks in the process of data cleansing. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. The background of implementation trials for these concepts was chosen as Scholarship Portal data developed for various organizations where finding and identifying of such records to the most possible extents as well as enabling the genuine students not to be debarred from getting scholarships as it has various kind of reservation/quota mechanism was a dire need.
机译:数据量的快速增长以及对来自各种异构资源的数据进行集成的需求,使有效检测数据库中记录的重复副本的测试脱颖而出。由于数据源是不连贯和自治的,因此它们可能会采用自己的约定,并且经常集成来自不同源的数据可能会导致数据的错误冗余。为了确保高质量的数据,数据库必须验证并过滤来自外部源的传入数据。在这方面,数据标准化已成为确保存储在这些数据库中的数据的高质量的必要条件。识别代表同一实体的记录对的过程通常称为重复记录检测,这使其成为数据清理过程中最重要的任务之一。提议的工作提出了一种提高重复记录检测过程准确性的方法,该方法与文本相似性和编辑距离的两个其他概念结合使用时,可以得到很好的过滤数据。选择这些概念的实施试验的背景,是为各种组织开发的奖学金门户网站数据,在这些组织中,尽可能地查找和识别此类记录,并且使真正的学生不被拒绝获得奖学金,因为它具有多种类型的保留/配额机制是迫切需要的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号