An Efficient Document Indexing-Based Similarity Search in Large Datasets

机译：大型数据集中基于文档索引的有效搜索

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapRe-duce scheme, which not only takes advantages over the baseline inverted index-based procedures but also adapts to various similarity measures and similarity searches. Additionally, we present other strategic methods in order to potentially contribute to eliminating unnecessary data and computations. Last but not least, empirical evaluations are intensively conducted with real massive datasets and Hadoop framework in the cluster of commodity machines to verify the proposed methods, whose promising results show how much beneficial they are when dealing with big data.

机译：在本文中，我们主要致力于提出一种新颖的基于MapReduce的方法来进行大数据的有效相似性搜索。具体来说，我们解决了在MapReduce相似性搜索中使用倒排索引的弊端，然后提出了一个简单而有效的无冗余MapRe-duce方案，该方案不仅比基于基线的基于倒排索引的过程更具优势，而且还可以适应各种相似性度量和相似性搜索。此外，我们提出了其他战略方法，以潜在地有助于消除不必要的数据和计算。最后但并非最不重要的一点是，使用商品计算机集群中的真实海量数据集和Hadoop框架进行了密集的实证评估，以验证所提出的方法，其有希望的结果表明它们在处理大数据时有多大的益处。

著录项

来源
《International conference on future data and security engineering》|2015年|16-31|共16页
会议地点
作者
Trong Nhan Phan; Markus Jaeger; Stefan Nadschlaeger; Josef Kueng; Tran Khanh Dang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Similarity search; Efficiency; Mapreduce; Large datasets; Clustering; Filtering; Redundancy-free capability; Document indexing;

机译：相似度搜索;效率; Mapreduce;大型数据集;集群;过滤;无冗余能力;文件索引;

相似文献

外文文献
中文文献
专利

1. Dimensionality Reduction for Efficient Document Similarity Detection in Big Datasets [J] . Niyigena Papias, Zuping Zhang, Oad Ammar Journal of computational and theoretical nanoscience . 2017,第6期

机译：大数据集中有效文档相似性检测的维度降低
2. Efficient top-k similarity document search utilizing distributed file systems and cosine similarity [J] . Alewiwi Mahmoud, Orencik Cengiz, Savas Erkay Cluster computing . 2016,第1期

机译：利用分布式文件系统和余弦相似度的高效top-k相似度文档搜索
3. A fast and scalable similarity search in high-dimensional image datasets [J] . Youssef Hanyf, Hassan Silkan International Journal of Computer Applications in Technology . 2019,第1期

机译：在高维图像数据集中快速且可扩展的相似性搜索
4. An Efficient Document Indexing-Based Similarity Search in Large Datasets [C] . Trong Nhan Phan, Markus Jaeger, Stefan Nadschlaeger, International conference on future data and security engineering . 2015

机译：在大型数据集中有效的基于文档索引的相似性搜索
5. Hashing Based Similarity Search over Massive Datasets [D] . Li, Jinfeng. 2018

机译：基于哈希的大规模数据集相似度搜索
6. GEMINI: a computationally-efficient search engine for large gene expression datasets [O] . Timothy DeFreitas, Hachem Saddiki, Patrick Flaherty 2016

机译：GEMINI：计算效率高的大型基因表达数据集搜索引擎
7. Efficient Pairwise Document Similarity Computation in Big Datasets [O] . Papias Niyigena, Zhang Zuping, Weiqi Li, 2015

机译：大数据集中的高效成对文档相似性计算
8. Efficient Video Similarity Measurement and Search. [R] . Cheung, S. 2002

机译：高效的视频相似度测量和搜索。

An Efficient Document Indexing-Based Similarity Search in Large Datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅