首页> 外文期刊>Concurrency and computation: practice and experience >Pre-filtering based summarization for data partitioning in distributed stream processing
【24h】

Pre-filtering based summarization for data partitioning in distributed stream processing

机译:分布式流处理中数据分区的预筛选总结

获取原文
获取原文并翻译 | 示例

摘要

Load balancing among the processing elements (PEs) of distributed stream processing system (DSPS) is a key issue in the presence of data skewness. Existing data partitioning schemes for DSPS suffer from the scalability problem and system in-efficiency. Non-key based partitioning strategies raise prohibitively high memory overhead for the stateful operations with a large number of keys and high data parallelism, while the key-based schemes introduce load imbalance for highly skewed data. Predicting the nature of stream data in advance can help to reduce the load imbalance among the PEs of DSPS. For this purpose, the heavy hitter algorithms approximate the hot items of streaming data. However, existing designs suffer from unsatisfied prediction accuracy. In this work, we propose an efficient algorithm to filter hot items in a stream of incoming data. The proposed scheme dynamically monitors the items of a stream and greatly improves the accuracy of estimation by keeping the actual key-value pair for the frequent items. On one hand, to ensure better load balancing for the skewed data streams, the detected hot keys are directed to more than two PEs randomly from the limited workers. On the other hand, for less frequent keys, the proposed scheme explores the principle of the power of two choices to distribute load. We conduct extensive experiments on both real-world and synthetic data sets. The results show that the proposed pre-filtering approach significantly outperforms existing designs in terms of prediction accuracy. The results also show that our design achieves a more balanced load as compared to the existing designs.
机译:分布式流处理系统(DSP)的处理元件(PES)之间的负载平衡是存在数据偏差的关键问题。 DSP的现有数据分区方案遭受可伸缩性问题和系统的效率。基于非关键的分区策略对具有大量密钥和高数据并行性的有状态操作的高记忆开销,而基于密钥的方案引入了高度偏斜数据的负载不平衡。预先预测流数据的性质可以有助于降低DSP的PES之间的负载不平衡。为此目的,沉重的击球算法近似于流数据的热门项目。然而,现有的设计遭受不满意的预测准确性。在这项工作中,我们提出了一种有效的算法来滤除传入数据流中的热门项目。所提出的方案动态监控流的项目,并通过将实际的键值对保持频繁项目来大大提高估计的准确性。一方面,为了确保对偏斜数据流的更好负载平衡,检测到的热键从有限的工人随机地定向到两个以上的PE。另一方面,对于较少频繁的键,所提出的方案探讨了分配负载的两种选择的功率原理。我们对现实世界和合成数据集进行广泛的实验。结果表明,在预测准确性方面,所提出的预滤波方法显着优于现有的设计。结果还表明,与现有设计相比,我们的设计达到了更平衡的负载。

著录项

  • 来源
    《Concurrency and computation: practice and experience》 |2021年第20期|e6338.1-e6338.25|共25页
  • 作者单位

    Huazhong Univ Sci & Technol Sch Comp Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Wuhan Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Technol Serv Comp Technol & Syst Lab Wuhan Peoples R China|Huazhong Univ Sci & Technol Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan Peoples R China;

    Huazhong Univ Sci & Technol Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    distributed stream processing; load balancing; pre#8208; filtering;

    机译:分布式流处理;负载平衡;Pre‐过滤;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号