Recent proposals extend MapReduce, a widely-used Big Data processing framework, with sampling to improve performance by producing approximate results with statistical error bounds. However, because these systems perform global uniform sampling across the entire key space of input data, they may completely miss rare keys which may be unacceptable in some applications. Well-known stratified sampling avoids missing rare keys by obtaining the same number of samples for each key which also achieves good performance by sampling popular keys infrequently and rare keys more often. While online stratified sampling has been done in centralized settings, straightforward extension to MapReduce's distributed setting cannot easily leverage the number of per-key samples seen globally by all the Mappers to reduce the sampling rate of each Mapper in the future. Because there are hundreds of Mappers in a typical MapReduce job, such feedback can drastically reduce oversampling and improve performance. We present MaDSOS (MapReduce with Distributed Stratified Online Sampling) which makes two contributions: (1) Instead of a fixed n per-key samples and the resultant sampling rates, we propose a telescoping algorithm that uses fixed sampling rates of the form 1/2~k and, between n and 2n samples. (2) We propose a collaborative feedback scheme, that is enabled by the specific form of sampling rates and the leniency in the sample counts, to efficiently cut the sampling rates, and thus oversampling, once the desired number of samples have been seen globally. For our MapReduce benchmarks, MaDSOS improves performance by 59% over Hadoop while guaranteeing never to miss rare keys and achieves 2.5% per-key error compared to 100% worst-case error under global sampling at a fixed rate for all the keys.
展开▼