首页> 外文会议>International Conference on Cloud and Autonomic Computing >Efficient Collaborative Approximation in MapReduce Without Missing Rare Keys
【24h】

Efficient Collaborative Approximation in MapReduce Without Missing Rare Keys

机译:MapReduce中的高效协同近似而不会丢失稀有键

获取原文

摘要

Recent proposals extend MapReduce, a widely-used Big Data processing framework, with sampling to improve performance by producing approximate results with statistical error bounds. However, because these systems perform global uniform sampling across the entire key space of input data, they may completely miss rare keys which may be unacceptable in some applications. Well-known stratified sampling avoids missing rare keys by obtaining the same number of samples for each key which also achieves good performance by sampling popular keys infrequently and rare keys more often. While online stratified sampling has been done in centralized settings, straightforward extension to MapReduce's distributed setting cannot easily leverage the number of per-key samples seen globally by all the Mappers to reduce the sampling rate of each Mapper in the future. Because there are hundreds of Mappers in a typical MapReduce job, such feedback can drastically reduce oversampling and improve performance. We present MaDSOS (MapReduce with Distributed Stratified Online Sampling) which makes two contributions: (1) Instead of a fixed n per-key samples and the resultant sampling rates, we propose a telescoping algorithm that uses fixed sampling rates of the form 1/2~k and, between n and 2n samples. (2) We propose a collaborative feedback scheme, that is enabled by the specific form of sampling rates and the leniency in the sample counts, to efficiently cut the sampling rates, and thus oversampling, once the desired number of samples have been seen globally. For our MapReduce benchmarks, MaDSOS improves performance by 59% over Hadoop while guaranteeing never to miss rare keys and achieves 2.5% per-key error compared to 100% worst-case error under global sampling at a fixed rate for all the keys.
机译:最近的建议扩展了MapReduce,广泛使用的大数据处理框架,采样通过产生统计误差界限的近似结果来提高性能。但是,由于这些系统在输入数据的整个关键空间中执行全球均匀采样,因此它们可能完全错过罕见的键,这在某些应用中可能是不可接受的。众所周知的分层采样通过获得相同数量的样品来避免缺少罕见的键,该钥匙也通过更频繁地采样流行钥匙和罕见的键来实现良好的性能。虽然在集中设置中已经完成了在线分层采样,但MapReduce的分布式设置的直接扩展不能轻易利用全部映射器全局看到的每次键样本的数量,以降低未来每个映射器的采样率。由于典型的MapReduce工作中有数百个映射器,因此此类反馈可以大大降低过采样并提高性能。我们展示了MadsoS(MapReduce与分布式的在线采样),这使得两种贡献:(1)而不是固定的N个每键样本和所得到的采样率,我们提出了一种伸缩算法,其使用FORM 1/2的固定采样率〜K,在N和2N样品之间。 (2)我们提出了一种协作反馈方案,其通过采样率的特定形式和样本计数的宽大,以有效地切割采样率,从而过量采样,一旦全球所需数量的样本。对于我们的MapReduce基准,MadsoS通过Hadoop提高了59%的性能,同时保证从未错过稀有键并以每键在全局采样下的100%最差情况下以固定速率为100%的最坏情况错误而实现2.5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号