首页> 美国卫生研究院文献>other >Distributed query-aware quantization for high-dimensional similarity searches
【2h】

Distributed query-aware quantization for high-dimensional similarity searches

机译:用于高维相似性搜索的分布式查询感知量化

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic. There is potential to improve accuracy when a query-dependent quantization is used.In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions.
机译:相似性的概念被用作许多数据探索和数据挖掘任务的基础。最近邻居(NN)查询标识最相似的项,或者根据距离来确定最接近点到查询点的距离。传统上,相似性是使用多维特征向量之间的距离函数来表征的。但是,当数据是高维数据时,传统的距离函数无法显着地区分最近的点和最远的点,因为很少有不同的维度主导距离函数。局部相似度函数,即仅考虑维度接近查询的函数,独立地量化每个维度,并且仅针对查询和点落入同一单元格的维度计算相似度。这些量化与查询无关。当使用依赖于查询的量化时,有可能提高准确性。本文提出了一种基于查询的等速(QED)动态量化方法,以改进高维相似性搜索。在查询时对每个维度进行量化,并针对最接近的p个分数点生成局部分数,同时对其余的分数应用恒定的惩罚。 QED不仅可以提高距离度量的质量,还可以通过过滤掉无关数据来提高查询时间性能。我们提出了一种分布式索引和查询算法来有效地计算QED。我们的实验结果表明,与基于曼哈顿的数百维数据集上的顺序扫描NN查询相比,分类准确性和查询性能提高了一个数量级。

著录项

  • 期刊名称 other
  • 作者单位
  • 年(卷),期 -1(2018),-1
  • 年度 -1
  • 页码 373–384
  • 总页数 39
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号