首页> 外国专利> OPTIMIZED SUBSET PROCESSING FOR DE-DUPLICATION

OPTIMIZED SUBSET PROCESSING FOR DE-DUPLICATION

机译:重复数据删除的优化子集处理

摘要

Some embodiments of the present invention include a method for identifying duplicate records from a group of records in a database system. The method includes generating a cluster of records from a group of records based on one or more keys; splitting the cluster of records into multiple subsets of records with each subset of records having fewer number of records than the cluster of records, wherein the splitting the cluster of records into multiple subsets of records is based on a number of records in the cluster of records exceeding a threshold; causing duplicate sets of records in each of the subsets of records to be identified, wherein a duplicate set of records includes one or more records, and wherein when a duplicate set of records includes two or more records, the two or more records are duplicates of one another; merging all of the duplicate sets of records identified from the multiple subsets of records forming a first group of duplicate sets of records; and forming a representative set of records based on selecting a representative record from each of the duplicate sets in the first group of duplicate sets of records.
机译:本发明的一些实施例包括一种用于从数据库系统中的一组记录中识别重复记录的方法。该方法包括基于一个或多个键从一组记录中生成记录簇;将记录集群分为多个记录子集,每个记录子集具有比记录集群少的记录数,其中,将记录集群分为多个记录子集是基于记录集群中的多个记录超过阈值;使每个记录子集中的重复记录集被识别,其中重复记录集包括一个或多个记录,并且其中当重复记录集包括两个或多个记录时,两个或多个记录是以下项的重复:另一个;合并从多个记录子集中识别出的所有重复记录集,形成第一组重复记录集;基于从第一组重复记录集中的每个重复集中选择一个代表记录,形成代表记录集。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号