首页> 外国专利> Detecting duplicate records in database

Detecting duplicate records in database

机译：检测数据库中的重复记录

页面导航

摘要
著录项
相似文献

摘要

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

机译：本发明涉及数据库中重复元组的检测。先前对域的重复元组的独立于域的检测依赖于多属性元组之间的标准相似度函数（例如，编辑距离，余弦度量）。但是，如果将这些现有技术方法用于识别特定于域的缩写和约定，则会导致大量误报。根据本发明，基于对来自数据仓库中的多维表的记录的解释来实现重复检测的过程，这些记录与通过雪花模式中的键-外键关系指定的层次结构相关联。本发明利用从表层次结构中可获得的额外知识来开发高质量，可扩展的重复检测过程。

著录项

公开/公告号US6961721B2

专利类型
公开/公告日2005-11-01

原文格式PDF
申请/专利权人 SURAJIT CHAUDHURI;VENKATESH GANTI;ROHIT ANANTHAKRISHNA;
展开▼

申请/专利号US20020186031
发明设计人 VENKATESH GANTI;ROHIT ANANTHAKRISHNA;SURAJIT CHAUDHURI;
展开▼

申请日2002-06-28
分类号G06F17/30;G06F7/00;
国家 US
入库时间 2022-08-21 22:20:13

相似文献

专利
外文文献
中文文献