An n-gram-based approach for detecting approximately duplicate database records

Zengping Tian; Hongjun Lu; Wenyun Ji; Aoying Zhou; Zhong Tian

首页> 外文期刊>International journal on digital libraries >An n-gram-based approach for detecting approximately duplicate database records

【24h】

An n-gram-based approach for detecting approximately duplicate database records

机译：基于n元语法的方法，用于检测近似重复的数据库记录

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word switching, are often present in real-world databases. This paper presents an n-gram-based approach for detecting duplicate records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values. The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy. Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared with previous methods, the algorithm is more time efficient.

机译：检测和消除重复记录是提高数据质量的主要任务之一。但是，由于实际数据库中经常出现各种错误，例如字符插入，删除，换位，替换和单词切换，因此该任务看起来并不那么琐碎。本文提出了一种基于n-gram的方法来检测大型数据库中的重复记录。使用该方法，首先根据记录的字段值的n元语法将记录映射到数字。然后将获得的数字聚类，并将聚类内的记录作为潜在的重复记录。最后，在集群中执行记录比较以识别真正的重复记录。此方法的独特之处在于，它无需进行预处理即可纠正源数据中的语法或印刷错误，从而可以实现较高的准确性。而且，不需要对源数据文件进行排序。只需要固定数量的数据库扫描。因此，与以前的方法相比，该算法具有更高的时间效率。

著录项

来源
《International journal on digital libraries》 |2002年第4期|p.325-331|共7页
作者
Zengping Tian; Hongjun Lu; Wenyun Ji; Aoying Zhou; Zhong Tian;
展开▼
作者单位

Department of Computer Science, Fudan University, Shanghi, 200433, P.R. China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类图书馆学、图书馆事业;计算技术、计算机技术;
关键词
duplicate elimination; N-gram; edit distance; data quality;

机译：重复消除;N-gram;编辑距离;数据质量;
入库时间 2022-08-18 02:09:35

相似文献

外文文献
中文文献
专利

1. Detecting dispersed duplications in high-throughput sequencing data using a database-free approach [J] . Kroon M., Lameijer E. W., Lakenberg N., Bioinformatics . 2016,第4期

机译：使用无数据库方法检测高通量测序数据中的分散重复项
2. A Unified Approach to Detect the Record Duplication Using BAT Algorithm and Fuzzy Classifier for Health Informatics [J] . Senthilkumar P., Vanitha N. Suthanthira Journal of Medical Imaging and Health Informatics . 2015,第6期

机译：使用BAT算法和模糊分类器的健康信息学统一检测记录重复的方法
3. Detecting Duplicates and near Duplicates Records in Large Datasets [J] . Shailesh Singh, Syed Imtiyaz Hassan International Journal on Computer Science and Engineering . 2017,第5期

机译：在大型数据集中检测重复记录和近重复记录
4. Detecting Approximately Duplicate Records in Database [C] . Xingrui Liu, Lijun Xu International conference on information engineering and applications . 2013

机译：在数据库中检测大约重复的记录
5. Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record. [D] . Wrenn, Jesse. 2010

机译：电子病历中的电子文档支持工具和文本复制。
6. Sarcopenia frailty and cachexia patients detected in a multisystem electronic health record database [O] . Ranjani N. Moorthi, Ziyue Liu, Sarah A. El-Azab, 2020

机译：在多系统电子健康记录数据库中检测到SARCOPENIAFRAIRTY和CACHEXIA患者
7. Detecting duplicate bug report using character n-gram-based features [O] . Ashish Sureka, Pankaj Jalote 2014

机译：使用基于字符n-gram的功能检测重复的错误报告
8. Detecting relationships between the interannual variability in climate records and ecological time series using a multivariate statistical approach - four case studies for the North Sea region [R] . Heyen, H. 1998

机译：利用多元统计方法检测气候记录的年际变率与生态时间序列之间的关系 - 北海地区的四个案例研究

An n-gram-based approach for detecting approximately duplicate database records

摘要

著录项

相似文献

相关主题

期刊订阅