【24h】

On Indexing Error-Tolerant Set Containment

机译:索引差错设定集合

获取原文

摘要

Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.
机译:事先工作已经确定了基于集的比较作为用于支持在记录匹配中的各种相似性函数的有用原始的。因此,已经提出了各种技术来改善设定相似范围的性能。然而,这种工作体几乎专注于设置相似性的对称概念。在本文中,我们研究了非对称Jaccard容纳相似性函数的索引问题,这是设定容器的耐堵塞变化。我们增强了此相似性功能,也可以解释为串的转换,反映诸如“鲍勃”和“robert”等同一名称的同义词。我们提出了一个索引结构,它在仔细选择的令牌集和查找算法上构建反转列表,并使用我们的索引对查询的输出大小敏感的索引。我们对现实生活数据集的实验表明了我们技术的好处。据我们所知,这是第一种研究串转换存在下Jaccard遏制的索引问题的论文。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号