【24h】

A performance study of the chain sampling algorithm

机译:链采样算法的性能研究

获取原文
获取原文并翻译 | 示例

摘要

On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.
机译:由于在许多应用领域中从多个异构源发出的数据流的比率一直在不断增长,因此在线数据流分析是当今的一项重要挑战。为了减少数据流的数量,数据流研究社区设计了几种采样方法。在本文中,我们将重点放在Babcock等人提出的链采样算法上。该算法的目的是随时随地从最后一个滑动窗口中包含的流的最新项中随机选择给定的固定比例。该算法非常适合流上下文,因为仅执行一次数据传递。此外,由于它不存储当前滑动窗口的所有项目,因此它使用的内存很小。我们在本文中证明链采样算法存在一些冲突或冗余问题。在算法执行过程中多次选择同一项目作为样本时发生冲突。我们提出了两种方法来克服此缺点并改进链采样算法。第一个称为“以高采样率反转选择”,第二个则受“分而治之”的启发。进行不同的实验以显示这两个改进的效率,尤其是它们对算法执行时间的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号