首页> 外国专利> Primitive operator for similarity joins in data cleaning

Primitive operator for similarity joins in data cleaning

机译:相似性的原始运算符加入数据清理

摘要

A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.
机译:提供了一种集合相似性连接系统和方法。该系统可用于通过识别“封闭”元组(例如,记录和/或行)来基于相似性促进数据清理。可以使用为适应领域和/或应用而选择的相似度函数来评估“接近度”。因此,该系统促进了通用的独立于域的数据清除。该系统可以与基础原语,集合相似性联接(SSJoin)运算符一起使用,该操作符可以用作实现各种相似性概念(例如,编辑相似性,Jaccard相似性,广义编辑相似性,汉明)的构件。距离,soundex等)以及基于共现的相似度。 SSJoin运算符可以利用这样的观察结果,即可以有效地使用集合重叠来支持各种相似功能。 SSJoin运算符根据与每个值关联(或为其显式构造)的“集”比较值。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号