首页> 外文学位 >ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce
【24h】

ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce

机译:ACE:使用MapReduce的敏捷,偶然和有效相似性联接

获取原文
获取原文并翻译 | 示例

摘要

Similarity Join is an important operation for data mining, with a diverse range of real world applications. Three efficient MapReduce algorithms for performing Similarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.;This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage. We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need.;In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file.;In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework. However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies.;SSS and ESSJ algorithms were developed using Hadoop and tested using real-world Twitter data. For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm.
机译:相似联接是数据挖掘的一项重要操作,具有各种实际应用程序。本文提出了三种高效的多集合之间相似连接的MapReduce算法。相似联接的过滤技术可最大程度地减少联接的实体对的数量,因此,它们对于提高算法效率至关重要。考虑到元素的频率,多集可以更好地表示现实世界的数据。先前的串行算法仅对集合(而非多集合)采用了过滤技术,而先前的MapReduce算法没有采用任何过滤技术或对可扩展性较差的前缀过滤进行了无效处理;这项工作扩展了过滤技术,即前缀,大小,位置和后缀过滤多集,并且还完成了将它们有效地合并到无共享MapReduce模型中的艰巨任务。巧妙地将过滤技术并入策略序列中,可以最大程度地减少生成和连接的对,从而提高I / O,网络和计算效率。在SSS算法中,前缀,大小和位置过滤都包含在MapReduce框架中。使用第二阶段生成的多集文件,在第三次“相似性加入”阶段将活跃地进行过滤的对牢固地加入。我们还开发了一种合理而创新的技术来增强该算法的可扩展性,这是一种应急需求。在ESSJ算法中,所有过滤技术(即前缀,大小,位置以及后缀过滤)都已纳入MapReduce框架中。它的设计具有无缝和可扩展的“相似性联接”阶段,其中,相似性联接是在不依赖文件的情况下执行的;在EASE算法中,所有过滤技术(即前缀,大小,位置和后缀)都已纳入MapReduce框架中。但是,它被定制为一种混合算法,以利用SSS和ESSJ的策略来执行联接。一些多集对是使用类似于SSS的“多集文件”进行连接的,而某些多集是在不使用类似于ESSJ的情况下进行连接的。该算法充分利用了这两种策略的优势。SSS和ESSJ算法是使用Hadoop开发的,并使用真实的Twitter数据进行了测试。对于SSS和ESSJ,实验结果表明,与竞争性的最新算法相比,性能提高了70%以上。

著录项

  • 作者单位

    The University of Toledo.;

  • 授予单位 The University of Toledo.;
  • 学科 Computer science.;Engineering.
  • 学位 M.E.
  • 年度 2013
  • 页码 103 p.
  • 总页数 103
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号