首页> 外文会议>Machine Vision, 2009. ICMV '09 >Duplicate Record Detection for Database Cleansing
【24h】

Duplicate Record Detection for Database Cleansing

机译:重复记录检测以进行数据库清理

获取原文

摘要

Many organizations collect large amounts of data to support their business and decision making processes. The data collected from various sources may have data quality problems in it. These kinds of issues become prominent when various databases are integrated. The integrated databases inherit the data quality problems that were present in the source database. The data in the integrated systems need to be cleaned for proper decision making. Cleansing of data is one of the most crucial steps. In this research, focus is on one of the major issue of data cleansing i.e. ȁC;duplicate record detectionȁD; which arises when the data is collected from various sources. As a result of this research study, comparison among standard duplicate elimination algorithm (SDE), sorted neighborhood algorithm (SNA), duplicate elimination sorted neighborhood algorithm (DE-SNA), and adaptive duplicate detection algorithm (ADD) is provided. A prototype is also developed which shows that adaptive duplicate detection algorithm is the optimal solution for the problem of duplicate record detection. For approximate matching of data records, string matching algorithms (recursive algorithm with word base and recursive algorithm with character base) have been implemented and it is concluded that the results are much better with recursive algorithm with word base.
机译:许多组织收集大量数据以支持其业务和决策流程。从各种来源收集的数据可能存在数据质量问题。当集成各种数据库时,这类问题变得突出。集成数据库继承了源数据库中存在的数据质量问题。集成系统中的数据需要清理以做出正确的决策。数据清理是最关键的步骤之一。在这项研究中,重点是数据清理的主要问题之一,即ȁC;重复记录检测ȁD;这是从各种来源收集数据时产生的。作为这项研究的结果,提供了标准重复消除算法(SDE),排序邻域算法(SNA),重复消除排序邻域算法(DE-SNA)和自适应重复检测算法(ADD)之间的比较。还开发了一个原型,该原型表明自适应重复检测算法是重复记录检测问题的最佳解决方案。对于数据记录的近似匹配,已经实现了字符串匹配算法(带词库的递归算法和带字符库的递归算法),并得出结论,使用带词库的递归算法效果更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号