We consider the problem of sampling $n$ numbers from the range${1,ldots,N}$ without replacement on modern architectures. The main resultis a simple divide-and-conquer scheme that makes sequential algorithms morecache efficient and leads to a parallel algorithm running in expected time$mathcal{O}left(n/p+log pight)$ on $p$ processors. The amount ofcommunication between the processors is very small and independent of thesample size. We also discuss modifications needed for load balancing, reservoirsampling, online sampling, sampling with replacement, Bernoulli sampling, andvectorization on SIMD units or GPUs.
展开▼
机译:我们考虑从Range $ {1, Ldots,n } $中从Range $ {1, ldots,n } $上采样$ n $的问题。主要结果是一个简单的划分和征服方案,使顺序算法MoreCache高效,并导致在预期的时间$ mathcal {o} left(n / p + log p loct)$ on $ p $的并行算法处理器。处理器之间的通信量非常小,独立于依地尺寸。我们还讨论了负载平衡,储层采样,在线采样,使用替换,伯努利采样,和SIMD单位或GPU上的和vpusization进行采样所需的修改。
展开▼