ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce

机译：ACE：使用MapReduce的敏捷，偶然和有效相似性联接

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Similarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.;This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need.;In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file.;In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies.;SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.

机译：相似联接是数据挖掘的一项重要操作，具有各种实际应用程序。本文提出了三种高效的多集合之间相似连接的MapReduce算法。相似联接的过滤技术可最大程度地减少联接的实体对的数量，因此，它们对于提高算法效率至关重要。考虑到元素的频率，多集可以更好地表示现实世界的数据。先前的串行算法仅对集合（而非多集合）采用了过滤技术，而先前的MapReduce算法没有采用任何过滤技术或对可扩展性较差的前缀过滤进行了无效处理;这项工作扩展了过滤技术，即前缀，大小，位置和后缀过滤多集，并且还完成了将它们有效地合并到无共享MapReduce模型中的艰巨任务。巧妙地将过滤技术并入策略序列中，可以最大程度地减少生成和连接的对，从而提高I / O，网络和计算效率。在SSS算法中，前缀，大小和位置过滤都包含在MapReduce框架中。使用第二阶段生成的多集文件，在第三次“相似性加入”阶段将活跃地进行过滤的对牢固地加入。我们还开发了一种合理而创新的技术来增强该算法的可扩展性，这是一种应急需求。在ESSJ算法中，所有过滤技术（即前缀，大小，位置以及后缀过滤）都已纳入MapReduce框架中。它的设计具有无缝和可扩展的“相似性联接”阶段，其中，相似性联接是在不依赖文件的情况下执行的；在EASE算法中，所有过滤技术（即前缀，大小，位置和后缀）都已纳入MapReduce框架中。但是，它被定制为一种混合算法，以利用SSS和ESSJ的策略来执行联接。一些多集对是使用类似于SSS的“多集文件”进行连接的，而某些多集是在不使用类似于ESSJ的情况下进行连接的。该算法充分利用了这两种策略的优势。SSS和ESSJ算法是使用Hadoop开发的，并使用真实的Twitter数据进行了测试。对于SSS和ESSJ，实验结果表明，与竞争性的最新算法相比，性能提高了70％以上。

著录项

作者
Lakshminarayanan, Mahalakshmi.;
展开▼
作者单位

The University of Toledo.;

展开▼
授予单位 The University of Toledo.;
学科 Computer science.;Engineering.
学位 M.E.
年度 2013
页码 103 p.
总页数 103
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. An efficient MapReduce algorithm for similarity join in metric spaces [J] . Liu Wen, Shen Yanming, Wang Peng Journal of supercomputing . 2016,第3期

机译：度量空间中相似连接的高效MapReduce算法
2. Efficient Similarity Join Based on Earth Mover’s Distance Using MapReduce [J] . Xu Jia, Lei Bin, Gu Yu, Knowledge and Data Engineering, IEEE Transactions on . 2015,第8期

机译：使用MapReduce根据地球移动者的距离进行有效的相似连接
3. Efficient and Scalable Graph Similarity Joins in MapReduce [J] . YifanChen, XiangZhao, ChuanXiao, ScientificWorldJournal . 2014,第3期

机译：MapReduce中的高效和可伸缩的图形相似性连接
4. MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint [C] . Decai Sun, Xiaoxia Wang International conference on cloud computing and security . 2018

机译：MLS-Join：一种有效的基于MapReduce的具有编辑距离约束的字符串相似性自联接算法
5. Efficient Algorithms for Frequent Path Finding and Similarity Join in Big Multidimensional Data [D] . Luo, Wuman 2012

机译：大多维数据中频繁路径查找和相似联接的高效算法
6. Efficient and Scalable Graph Similarity Joins in MapReduce [O] . Yifan Chen, Xiang Zhao, Chuan Xiao, -1

机译：高效且可扩展的图相似度加入MapReduce
7. MELODY-JOIN: Efficient Earth Mover’s Distance Similarity Joins Using MapReduce [O] . Jin Huang, Rui Zhang, Rajkumar Buyya, 2014

机译：MELODY-JOIN：使用MapReduce高效地移动者的距离相似性

ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce

摘要

著录项

相似文献

相关主题

期刊订阅