首页> 外文期刊>Arabian Journal for Science and Engineering >Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique
【24h】

Improved Duplicate Record Detection Using ASCII Code Q-gram Indexing Technique

机译:使用ASCII码Q-gram索引技术的改进的重复记录检测

获取原文
获取原文并翻译 | 示例
       

摘要

With the aim of reducing duplicate records in databases, duplicate record detection (DRD) ensures the integrity of data. Its role is to identify records signifying same entities either in the same or in different compared to database. A diversity of indexing techniques has been proposed to support DRD. Q-gram is one of the common techniques used to index databases. This paper introduces modification to the Q-gram indexing technique. Such modification participates in improving the performance of the duplicate detection process and in reducing the time and number of comparisons. In the proposed work, in order to make the back-end computations easier, Q-gram strings are alternatively converted into numeric values using their corresponding ASCII code. Based on these numeric values, the indexing will decrease the complexity of Q-gram comparisons and speed up the DRD process as a whole. Unlike the existing approaches, the proposed technique is easier in implementation and requires less memory space. Two other variations of the proposed technique are introduced in this paper to decrease the matching process time; the first uses a range for matching, while the second sorts words alphabetically inside blocks. According to experimental results, the three proposed techniques perform much faster and are almost as accurate as the current Q-gram technique, meaning that they can be used in large-sized databases DRD.
机译:为了减少数据库中的重复记录,重复记录检测(DRD)可确保数据的完整性。它的作用是识别表示与数据库相比相同或不同的相同实体的记录。已经提出了多种索引技术来支持DRD。 Q-gram是用于索引数据库的常见技术之一。本文介绍了对Q-gram索引技术的修改。此类修改有助于提高重复检测过程的性能,并减少比较的时间和数量。在提出的工作中,为了使后端计算更容易,可使用其对应的ASCII码将Q-gram字符串转换为数值。基于这些数值,索引将降低Q-gram比较的复杂性,并从整体上加快DRD的过程。与现有方法不同,所提出的技术易于实现并且需要较少的存储空间。本文介绍了所提出技术的另外两个变种,以减少匹配过程的时间。第一个使用匹配范围,第二个在块内按字母顺序对单词进行排序。根据实验结果,提出的三种技术的执行速度要快得多,并且几乎与当前的Q-gram技术一样准确,这意味着它们可以在大型数据库DRD中使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号