首页> 外文会议> >Effective indexing and filtering for similarity search in large biosequence databases
【24h】

Effective indexing and filtering for similarity search in large biosequence databases

机译:有效索引和过滤,可在大型生物序列数据库中进行相似性搜索

获取原文

摘要

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries, (b) pruning ability and (c) approximation quality for E-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.
机译:我们在DNA和蛋白质数据库中提出了一种用于快速序列相似性搜索的多维索引方法。特别是,我们提出了有效转变在数值矢量域中的子序列,并在转换向量上建立有效的指标结构。然后,我们在变换域中定义距离函数并检查这些功能的属性。我们通过实验比较了k最近邻(k-nn)查询的(a)近似质量,(b)修剪能力和(c)近似质量的电子范围查询。我们在这里展示的K-NN查询结果表明,我们所提出的距离FD2和WD2(即2克的频率和小波距离函数)显着比其他方式更好地执行。然后,我们基于R树和标量量化在变换的向量和距离函数之上开发有效的索引结构。提出了来自实际生物酶数据集的实验的有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号