...
首页> 外文期刊>Distributed and Parallel Databases >High-dimensional similarity searches using query driven dynamic quantization and distributed indexing
【24h】

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

机译:使用查询驱动动态量化和分布式索引的高维相似性搜索

获取原文
获取原文并翻译 | 示例
           

摘要

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.
机译:相似性的概念用作许多数据探索和数据挖掘任务的基础。最近的邻居(NN)查询标识最相似的项目,或者在距离最接近的点到查询点。相似性传统地使用多维特征向量之间的距离功能来表征。然而,当数据是高维的时,传统距离功能无法显着区别最接近和最远的点,因为很少的不同尺寸占据距离功能。本地化的相似性函数,即仅考虑靠近查询的尺寸的函数,独立地量化每个维度,并且仅计算查询和点落入相同箱的尺寸的尺寸的相似性。这些量化是查询 - 不可知的,并且当使用依赖于查询的量化时,可能有可能提高精度。在这项工作中,我们提出了一个查询依赖的Equi-Depli-Depli-Depli-Depli-Deply(QED)在--ver-Fly量化方法,以改善高维相似性搜索。对查询时间的每个维度进行量化,并且为该点的最近的P分数产生局部分数,而持续的惩罚应用于其余点。 QED不仅提高了距离度量的质量,还通过过滤掉非相关数据来提高查询时间性能。我们提出了一种分布式索引和查询算法,以有效计算QED。我们的实验结果表明,分类准确性的改进以及比曼哈顿的连续扫描NN在具有数百个维度的数据集上查询的速度快到一个数量级的Quicate精度。此外,使用QED的相似性搜索,以与维度的数量和计算节点的数量相比,显示线性或更好的可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号