首页> 外文会议>33rd International Conference on Very Large Data Bases(VLDB 2007) >VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
【24h】

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

机译:VGRAM:使用可变长度语法提高字符串集合上的近似查询的性能

获取原文

摘要

Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algorithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms.
机译:许多应用程序需要解决以下近似字符串匹配的问题:如何从一个字符串集合中查找与给定字符串相似的字符串,或者从另一个(可能是相同的)字符串集合中查找这些字符串?许多算法都是使用定长克来开发的,定长克是字符串的子字符串,用作识别相似字符串的签名。在本文中,我们开发了一种称为VGRAM的新颖技术,以提高这些算法的性能。其主要思想是从字符串集合中明智地选择高质量的可变长度克,以支持对该集合的查询。我们提供了这项技术的完整规范,包括如何从集合中选择高质量的克,如何基于预选的克为字符串生成变长克,以及克的克集相似度之间的关系是什么。两个字符串及其编辑距离。该技术的主要优势在于,它可以被大量的近似字符串算法所采用,而无需对其进行实质性的修改。我们在真实的数据集上进行了广泛的实验,以评估该技术,并显示了三种现有算法的显着性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号