Similarity Search for Dynamic Data Streams

Bury Marc; Schwiegelshohn Chris; Sorella Mara

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Similarity Search for Dynamic Data Streams

【24h】

Similarity Search for Dynamic Data Streams

机译：相似性搜索动态数据流

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Nearest neighbor searching systems are an integral part of many online applications, including but not limited to pattern recognition, plagiarism detection, and recommender systems. With increasingly larger data sets, scalability has become an important issue. Many of the most space and running time efficient algorithms are based on locality-sensitive hashing. Here, we view the data set as an n by vertical bar U vertical bar matrix where each row corresponds to one of n users and the columns correspond to items drawn from a universe U. The de-facto standard approach to quickly answer nearest neighbor queries on such a data set is usually a form of min-hashing. Not only is min-hashing very fast, but it is also space efficient and can be implemented in many computational models aimed at dealing with large data sets such as MapReduce and streaming. However, a significant drawback is that minhashing and related methods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality-sensitive hashing (LSH) for fully dynamic data-streams. Specifically, using the Jaccard index as similarity measure, we design (1) a collaborative filtering mechanism maintainable in dynamic data streams and (2) a sketching algorithm for similarity estimation. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion only case, and drastically outperform previous algorithms in case of deletions.

机译：最近的邻居搜索系统是许多在线应用程序的一个组成部分，包括但不限于模式识别，抄袭检测和推荐系统。凭借越来越大的数据集，可扩展性已成为一个重要问题。许多最空间和运行时间高效算法都基于位置敏感散列。这里，通过垂直条形U垂直条矩阵将数据设置为n，其中每行对应于n个用户之一，列对应于从宇宙U的绘制的项目。即可快速应答最近的邻权的De-Facto标准方法在这种数据集上通常是min-hashing的形式。最小散列不仅是较快的，而且它也是空间高效，可以在许多计算模型中实现，旨在处理大数据集，例如MapReduce和Streaming。然而，显着的缺点是Minhashing和相关方法仅能够处理对用户配置文件的插入，并且在可以移除物品时往往会表现不佳。我们开始研究可扩展的位置敏感散列（LSH），用于完全动态数据流。具体地，使用Jaccard指数作为相似度量，我们设计（1）在动态数据流中可维护的协同滤波机制和（2）用于相似性估计的草图算法。与以前的LSH方法相比，我们的算法在运行时间方面几乎没有，并且在删除情况下，以前的LSH唯一的LSH方法以及急剧优于先前的算法。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2020年第11期|2241-2253|共13页
作者
Bury Marc; Schwiegelshohn Chris; Sorella Mara;
展开▼
作者单位

TU Dortmund D-44227 Dortmund Germany;

Sapienza Univ Rome I-00185 Rome Italy;

Sapienza Univ Rome I-00185 Rome Italy;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Heuristic algorithms; Hash functions; Approximation algorithms; Indexes; Measurement; Knowledge engineering; Data engineering; Dynamic streaming; locality-sensitive hashing; nearest neighbor searching;

机译：启发式算法;散列函数;近似算法;索引;测量;知识工程;数据工程;动态流;地区敏感散列;最近的邻居搜索;

相似文献

外文文献
中文文献
专利

1. Similarity search for numerous patterns over multiple time series streams under dynamic time warping which supports data normalization [J] . Bui Cong Giao, Duong Tuan Anh Vietnam Journal of Computer Science . 2016,第3期

机译：在动态时间规整下对多个时间序列流上的众多模式进行相似性搜索，从而支持数据归一化
2. Tree Based Fast Similarity Query Search Indexing on Outsourced Cloud Data Streams [J] . Balasubramanian Balamurugan, Durai Kamalraj, Sathyanarayanan Jegadeeswari, The international arab journal of information technology . 2019,第5期

机译：基于树的快速相似性查询在外包云数据流上搜索索引
3. Automated protein sequence database classification.I.Integration of compositional similarity search,local similarity search,and multiple sequence alignment [J] . Jerome Gracy... Bioinformatics . 1998,第2期

机译：自动化蛋白质序列数据库分类.I。组成相似性搜索，局部相似性搜索和多序列比对的整合
4. Towards Faster Similarity Search by Dynamic Reordering of Streamed Queries [C] . Filip Nalepa, Michal Batko, Pavel Zezula International conference on databases and expert systems applications . 2016

机译：通过流查询的动态重新排序实现更快的相似性搜索
5. Similarity Search on High Dimensional Data [D] . Liu, Yingfan. 2019

机译：相似性搜索高维数据
6. An integrated approach towards the development of novel antifungal agents containing thiadiazole: synthesis and a combined similarity search homology modelling molecular dynamics and molecular docking study [O] . Mustafa Er, Abdulati Miftah Abounakhla, Hakan Tahtaci, 2018

机译：开发含有噻二唑的新型抗真菌剂的综合方法：合成及相似性搜索同源性建模分子动力学和分子对接研究
7. Similarity search for numerous patterns over multiple time series streams under dynamic time warping which supports data normalization [O] . Bui Cong Giao, Duong Tuan Anh 2016

机译：在动态时间规整下对多个时间序列流上的众多模式进行相似性搜索，从而支持数据归一化

Similarity Search for Dynamic Data Streams

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅