Duplicate Record Detection for Database Cleansing

机译：重复记录检测以进行数据库清理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. ȁC;duplicate record detectionȁD; which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.

机译：许多组织收集大量数据以支持其业务和决策流程。从各种来源收集的数据可能存在数据质量问题。当集成各种数据库时，这类问题变得突出。集成数据库继承了源数据库中存在的数据质量问题。集成系统中的数据需要清理以做出正确的决策。数据清理是最关键的步骤之一。在这项研究中，重点是数据清理的主要问题之一，即ȁC；重复记录检测ȁD；这是从各种来源收集数据时产生的。作为这项研究的结果，提供了标准重复消除算法（SDE），排序邻域算法（SNA），重复消除排序邻域算法（DE-SNA）和自适应重复检测算法（ADD）之间的比较。还开发了一个原型，该原型表明自适应重复检测算法是重复记录检测问题的最佳解决方案。对于数据记录的近似匹配，已经实现了字符串匹配算法（带词库的递归算法和带字符库的递归算法），并得出结论，使用带词库的递归算法效果更好。

著录项

来源
《Machine Vision, 2009. ICMV '09》|2009年|333-338|共6页
会议地点 Dubai(AE);Dubai(AE)
作者
Rehman Mariam; Esichaikul Vatcharapon;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Database Record Duplicate Detection System using Simil Algorithm [J] . Jumoke Soyemi, James Adegboye International Journal on Computer Science and Engineering . 2018,第2期

机译：使用Simil算法的数据库记录重复检测系统
2. Duplicate Record Detection and Replacement within a Relational Database [J] . S. Aquter Babu Advances in computational sciences and technology . 2017,第6a3a期

机译：关系数据库中的重复记录检测和替换
3. A Comprehensive Review of Significant Researches on Duplicate Record Detection in Databases [J] . K. Deepa, R. Rangarajan Advances in applied computational mechanics . 2014,第2期

机译：数据库中重复记录检测的重要研究综述
4. DUPLICATE RECORD DETECTION FOR DATABASE CLEANSING [C] . MARIAM REHMAN, VATCHARAPON ESICHAIKUL International Conference on Machine Vision . 2009

机译：数据库清洁的重复记录检测
5. Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record. [D] . Wrenn, Jesse. 2010

机译：电子病历中的电子文档支持工具和文本复制。
6. Intraprocedural bowel cleansing with the JetPrep cleansing system improves adenoma detection [O] . Arthur Hoffman, Sanjay Murthy, Lena Pompetzki, 2015

机译：使用JetPrep清洁系统进行术中肠清洁可改善腺瘤的检测
7. An Efficient Duplication Record Detection Algorithm for Data Cleansing [O] . Arfa Skandar, Mariam Rehman, Maria Anjum 2015

机译：一种有效的数据清理重复记录检测算法

Duplicate Record Detection for Database Cleansing

摘要

著录项

相似文献

相关主题

期刊订阅