【24h】

Weighted Set Similarity: Queries and Updates

机译:加权集相似性:查询和更新

获取原文

摘要

Consider a universe of items, each of which is associated with a weight, and a database consisting of subsets of these items. Given a query set, a weighted set similarity query identifies either (i) all sets in the database whose normalized similarity to the query set is above a pre-specified threshold, or (ii) the sets in the database with the k highest similarity values to the query set. Weighted set similarity queries are useful in applications like data cleaning and integration for finding approximate matches in the presence of typographical mistakes, multiple formatting conventions, transformation errors, etc. We show that this problem has semantic properties that can be exploited to design index structures that support efficient algorithms for answering queries; these algorithms can achieve arbitrarily stronger pruning than the family of Threshold Algorithms. We describe how these index structures can beefficiently updated using lazy propagation in a way that gives strict guarantees on the quality of subsequent query answers. Finally, we illustrate that our proposed ideas work well in practice for real datasets.
机译:考虑一整套项目,每个项目都与一个权重相关联,并考虑一个由这些项目的子集组成的数据库。在给定查询集的情况下,加权集相似度查询要么标识(i)数据库中与查询集的归一化相似度高于预定阈值的所有集,要么(ii)数据库中具有k个最高相似度值的集到查询集。加权集相似性查询在诸如数据清理和集成之类的应用中非常有用,可在存在印刷错误,多种格式约定,转换错误等情况下查找近似匹配项。我们证明了该问题具有可用于设计索引结构的语义属性。支持用于回答查询的高效算法;与“阈值算法”系列相比,这些算法可以实现更强的修剪效果。我们描述了如何使用延迟传播有效地更新这些索引结构,从而为后续查询答案的质量提供了严格的保证。最后,我们说明了我们提出的想法在实际数据集中的实践中效果很好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号