首页> 外文会议>IEEE International Conference on Fuzzy Systems >Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce
【24h】

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

机译:MapReduce中使用模糊过滤器的大规模模糊连接优化

获取原文

摘要

A fuzzy or similarity join is one of the most useful data processing and analysis operations for Big Data in a general context. It combines pairs of tuples for which the distance is lower than or equal to a given threshold ε. The fuzzy join is used in many practical applications, but it is extremely costly in time and space, and may even not be executed on large-scale datasets. Although there have been some studies to improve its performance by applying filters, a solution of an effective fuzzy filter for the join has never been conducted. In this paper, we thus extend our previous work by proposing a novel fuzzy filter to optimize fuzzy joins. This filter is a compact, probabilistic data structure that supports very fast similarity queries by maintaining a bit matrix, with small false positive rate and zero false negative rate. We show that our proposal is more efficient than others because of eliminating redundant data, reducing computation cost and avoiding duplicate output.
机译:在一般情况下,模糊或相似联接是大数据最有用的数据处理和分析操作之一。它组合了距离小于或等于给定阈值ε的成对的元组。模糊联接已在许多实际应用中使用,但是它在时间和空间上非常昂贵,甚至可能无法在大规模数据集上执行。尽管已经进行了一些研究,以通过应用过滤器来提高其性能,但是从未进行过有效的模糊过滤器联接的解决方案。因此,在本文中,我们通过提出一种新颖的模糊过滤器来优化模糊连接来扩展我们以前的工作。该过滤器是一种紧凑的概率数据结构,通过维护位矩阵(误报率小和误报率零)来支持非常快速的相似性查询。我们表明,由于消除了冗余数据,降低了计算成本并避免了重复输出,因此我们的提案比其他提案更有效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号