首页> 外文期刊>Information Systems >Efficient processing of similarity search under time warping in sequence databases: an index-based approach
【24h】

Efficient processing of similarity search under time warping in sequence databases: an index-based approach

机译:时间扭曲下序列数据库中相似搜索的有效处理:一种基于索引的方法

获取原文
获取原文并翻译 | 示例
       

摘要

This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, D_(tw-lb), which consistently underestimates the time warping distance and satisfies the triangular inequality. D_(tw-lb) uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D_(tw-lb) as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.
机译:本文讨论了支持大序列数据库中时间扭曲的相似性搜索的有效处理。通过时间扭曲,即使长度不同,也可以找到具有相似模式的序列。支持时间扭曲的用于处理相似性搜索的现有方法未能采用多维索引而没有错误消除,因为时间扭曲距离不满足三角形不等式。他们必须扫描整个数据库,从而使大型数据库的性能严重下降。租用后缀树的另一种方法(不假定任何距离函数)由于树大而性能也不佳。在本文中,我们提出了一种支持时间扭曲的相似性搜索新方法。我们的主要目标是提高大型数据库的搜索性能,而又不容许任何错误的辞退。为了实现此目标,我们设计了一个新的距离函数D_(tw-lb),该函数始终低估了时间扭曲距离并满足了三角不等式。 D_(tw-lb)使用一个四元组特征向量,该向量从每个序列中提取,并且对于时间扭曲是不变的。为了有效地处理相似性搜索,我们使用了多维索引,该多维索引使用4元组特征向量作为索引属性,而D_(tw-lb)作为距离函数。我们证明我们的方法不会引起错误的解雇。为了验证我们方法的优越性,我们进行了广泛的实验。结果表明,对于包含真实标准普尔500股票数据序列的数据集,我们的方法的速度提高了多达43倍,对于包含大量合成数据序列的数据集,速度提高了720倍。性能增益增加:(1)随着数据序列数量的增加,(2)数据序列的平均长度增加,以及(3)随着查询容忍度的降低。考虑到实际数据库的特性,这些趋势表明我们的方法适用于实际应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号