...
首页> 外文期刊>Data & Knowledge Engineering >An approximate algorithm for top-k closest pairs join query in large high dimensional data
【24h】

An approximate algorithm for top-k closest pairs join query in large high dimensional data

机译:大型高维数据中前k个最接近对的联接查询的一种近似算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity O(d~2nk) and space complexity O(nd) and guarantees a solution within a O(d~(1+1/t)) factor of the exact one, where t ∈ {1,2,..., ∞} denotes the Minkowski metrics L_t of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a scan, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one.
机译:在本文中,我们提出了一种新颖的近似算法,用于计算两个高维数据集的前k个最接近的对联接查询。该算法具有最坏情况的时间复杂度O(d〜2nk)和空间复杂度O(nd),并保证在正整数的O(d〜(1 + 1 / t))因子内进行求解,其中t∈{1, 2,...,∞}表示感兴趣的Minkowski度量L_t,d表示维度。它利用空间填充曲线的概念来建立空间点之间的顺序,并最多对两个数据集执行d +1排序和扫描。在扫描期间,根据空间填充曲线顺序,将一个数据集中的每个点与其最接近的点进行比较,然后在另一数据集中检测并消除其对解决方案的影响已被分析的点。对真实和合成数据集的实验结果表明,我们的算法在低维空间中的行为与精确算法相同。如果满足某些分离条件,即使对于高维,它也可以修剪整个(或相当一部分)数据集;无论如何,它都会在很小的误差范围内将解决方案返回到准确的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号