...
首页> 外文期刊>Theoretical computer science >Boosting distinct random sampling for basic counting on the union of distributed streams
【24h】

Boosting distinct random sampling for basic counting on the union of distributed streams

机译:促进独特的随机采样,以基本统计分布式流的并集

获取原文
获取原文并翻译 | 示例

摘要

We revisit the classic basic counting problem in the distributed streaming model. In the solution for maintaining an (epsilon,delta)-estimate, we make the following new contributions: (1) For a bit stream of size n, where each bit has a probability at least gamma to be 1, we exponentially reduced the average total processing time from the best prior work's Theta(n log(1/delta)) to O ((1/(gamma epsilon(2)))(log(2) n) log(1/delta)), thus providing the first sublinear-time streaming algorithm for this problem. (2) In addition to an overall much faster processing speed, our method provides a new tradeoff that a lower accuracy demand (a larger value for epsilon) promises a faster processing speed, whereas the best prior work's processing speed is Theta(n log(1/delta)) in any case and for any epsilon. (3) The worst-case total time cost of our method matches the best prior work's Theta(n log(1/delta)), which is necessary but rarely occurs in our method. (4) The space usage overhead in our method is a lower order term compared with the best prior work's space usage and occurs only O (logn) times during the stream processing and is too negligible to be detected by the OS in practice. We further validate these theoretical results with experiments on both real-world and synthetic data, showing that our method is faster than the best prior work by a factor of several to several hundreds depending on the stream size and accuracy demands, without any detectable space usage overhead. Our method is based on a faster sampling technique that we design for boosting the sampling procedure in the best prior work and we believe this technique can be of other independent interest. (C) 2015 Elsevier B.V. All rights reserved.
机译:我们将重新讨论分布式流模型中的经典基本计数问题。在维持(epsilon,delta)估计的解决方案中,我们做出了以下新贡献:(1)对于大小为n的比特流,其中每个比特的伽玛系数至少为1,我们按指数方式减小了平均值从最佳先前工作的Theta(n log(1 / delta))到O((1 /(γepsilon(2)))(log(2)n)log(1 / delta))的总处理时间,从而提供了解决此问题的第一个亚线性时间流算法。 (2)除了总体上要快得多的处理速度外,我们的方法还提供了一个新的权衡,即较低的精度要求(较大的epsilon值)将保证更快的处理速度,而最佳的先前工作的处理速度为Theta(n log( 1 / delta))在任何情况下以及任何epsilon。 (3)我们方法的最坏情况下的总时间成本与先前工作的最佳Theta(n log(1 / delta))相匹配,这是必要的,但在我们的方法中很少发生。 (4)与最好的先前工作的空间使用相比,我们的方法中的空间使用开销是一个较低阶的术语,并且在流处理期间仅发生O(登录)次,并且在实践中几乎无法被OS检测到。我们通过对真实数据和合成数据的实验进一步验证了这些理论结果,表明我们的方法比最佳的现有工作要快几到几百倍,具体取决于流的大小和精度要求,而没有任何可检测的空间使用情况高架。我们的方法基于一种更快的采样技术,我们设计该技术是为了在最好的现有工作中增强采样过程,并且我们认为该技术可能会引起其他人们的关注。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号