首页> 外文期刊>International journal on digital libraries >An n-gram-based approach for detecting approximately duplicate database records
【24h】

An n-gram-based approach for detecting approximately duplicate database records

机译:基于n元语法的方法,用于检测近似重复的数据库记录

获取原文
获取原文并翻译 | 示例
       

摘要

Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word switching, are often present in real-world databases. This paper presents an n-gram-based approach for detecting duplicate records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values. The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy. Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared with previous methods, the algorithm is more time efficient.
机译:检测和消除重复记录是提高数据质量的主要任务之一。但是,由于实际数据库中经常出现各种错误,例如字符插入,删除,换位,替换和单词切换,因此该任务看起来并不那么琐碎。本文提出了一种基于n-gram的方法来检测大型数据库中的重复记录。使用该方法,首先根据记录的字段值的n元语法将记录映射到数字。然后将获得的数字聚类,并将聚类内的记录作为潜在的重复记录。最后,在集群中执行记录比较以识别真正的重复记录。此方法的独特之处在于,它无需进行预处理即可纠正源数据中的语法或印刷错误,从而可以实现较高的准确性。而且,不需要对源数据文件进行排序。只需要固定数量的数据库扫描。因此,与以前的方法相比,该算法具有更高的时间效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号