首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Similarity Search for Dynamic Data Streams
【24h】

Similarity Search for Dynamic Data Streams

机译:相似性搜索动态数据流

获取原文
获取原文并翻译 | 示例

摘要

Nearest neighbor searching systems are an integral part of many online applications, including but not limited to pattern recognition, plagiarism detection, and recommender systems. With increasingly larger data sets, scalability has become an important issue. Many of the most space and running time efficient algorithms are based on locality-sensitive hashing. Here, we view the data set as an n by vertical bar U vertical bar matrix where each row corresponds to one of n users and the columns correspond to items drawn from a universe U. The de-facto standard approach to quickly answer nearest neighbor queries on such a data set is usually a form of min-hashing. Not only is min-hashing very fast, but it is also space efficient and can be implemented in many computational models aimed at dealing with large data sets such as MapReduce and streaming. However, a significant drawback is that minhashing and related methods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality-sensitive hashing (LSH) for fully dynamic data-streams. Specifically, using the Jaccard index as similarity measure, we design (1) a collaborative filtering mechanism maintainable in dynamic data streams and (2) a sketching algorithm for similarity estimation. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion only case, and drastically outperform previous algorithms in case of deletions.
机译:最近的邻居搜索系统是许多在线应用程序的一个组成部分,包括但不限于模式识别,抄袭检测和推荐系统。凭借越来越大的数据集,可扩展性已成为一个重要问题。许多最空间和运行时间高效算法都基于位置敏感散列。这里,通过垂直条形U垂直条矩阵将数据设置为n,其中每行对应于n个用户之一,列对应于从宇宙U的绘制的项目。即可快速应答最近的邻权的De-Facto标准方法在这种数据集上通常是min-hashing的形式。最小散列不仅是较快的,而且它也是空间高效,可以在许多计算模型中实现,旨在处理大数据集,例如MapReduce和Streaming。然而,显着的缺点是Minhashing和相关方法仅能够处理对用户配置文件的插入,并且在可以移除物品时往往会表现不佳。我们开始研究可扩展的位置敏感散列(LSH),用于完全动态数据流。具体地,使用Jaccard指数作为相似度量,我们设计(1)在动态数据流中可维护的协同滤波机制和(2)用于相似性估计的草图算法。与以前的LSH方法相比,我们的算法在运行时间方面几乎没有,并且在删除情况下,以前的LSH唯一的LSH方法以及急剧优于先前的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号