Pre-filtering based summarization for data partitioning in distributed stream processing

Aslam Adeel; Chen Hanhua; Jin Hai

首页> 外文期刊>Concurrency and computation: practice and experience >Pre-filtering based summarization for data partitioning in distributed stream processing

【24h】

Pre-filtering based summarization for data partitioning in distributed stream processing

机译：分布式流处理中数据分区的预筛选总结

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Load balancing among the processing elements (PEs) of distributed stream processing system (DSPS) is a key issue in the presence of data skewness. Existing data partitioning schemes for DSPS suffer from the scalability problem and system in-efficiency. Non-key based partitioning strategies raise prohibitively high memory overhead for the stateful operations with a large number of keys and high data parallelism, while the key-based schemes introduce load imbalance for highly skewed data. Predicting the nature of stream data in advance can help to reduce the load imbalance among the PEs of DSPS. For this purpose, the heavy hitter algorithms approximate the hot items of streaming data. However, existing designs suffer from unsatisfied prediction accuracy. In this work, we propose an efficient algorithm to filter hot items in a stream of incoming data. The proposed scheme dynamically monitors the items of a stream and greatly improves the accuracy of estimation by keeping the actual key-value pair for the frequent items. On one hand, to ensure better load balancing for the skewed data streams, the detected hot keys are directed to more than two PEs randomly from the limited workers. On the other hand, for less frequent keys, the proposed scheme explores the principle of the power of two choices to distribute load. We conduct extensive experiments on both real-world and synthetic data sets. The results show that the proposed pre-filtering approach significantly outperforms existing designs in terms of prediction accuracy. The results also show that our design achieves a more balanced load as compared to the existing designs.

机译：分布式流处理系统（DSP）的处理元件（PES）之间的负载平衡是存在数据偏差的关键问题。 DSP的现有数据分区方案遭受可伸缩性问题和系统的效率。基于非关键的分区策略对具有大量密钥和高数据并行性的有状态操作的高记忆开销，而基于密钥的方案引入了高度偏斜数据的负载不平衡。预先预测流数据的性质可以有助于降低DSP的PES之间的负载不平衡。为此目的，沉重的击球算法近似于流数据的热门项目。然而，现有的设计遭受不满意的预测准确性。在这项工作中，我们提出了一种有效的算法来滤除传入数据流中的热门项目。所提出的方案动态监控流的项目，并通过将实际的键值对保持频繁项目来大大提高估计的准确性。一方面，为了确保对偏斜数据流的更好负载平衡，检测到的热键从有限的工人随机地定向到两个以上的PE。另一方面，对于较少频繁的键，所提出的方案探讨了分配负载的两种选择的功率原理。我们对现实世界和合成数据集进行广泛的实验。结果表明，在预测准确性方面，所提出的预滤波方法显着优于现有的设计。结果还表明，与现有设计相比，我们的设计达到了更平衡的负载。

著录项

来源
《Concurrency and computation: practice and experience》 |2021年第20期|e6338.1-e6338.25|共25页
作者
Aslam Adeel; Chen Hanhua; Jin Hai;
展开▼
作者单位

Huazhong Univ Sci & Technol Sch Comp Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Wuhan Peoples R China;

Huazhong Univ Sci & Technol Sch Comp Sci & Technol Serv Comp Technol & Syst Lab Wuhan Peoples R China|Huazhong Univ Sci & Technol Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan Peoples R China;

Huazhong Univ Sci & Technol Sch Comp Sci & Technol Cluster & Grid Comp Lab Wuhan Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
distributed stream processing; load balancing; pre#8208; filtering;

机译：分布式流处理;负载平衡;Pre＆＃8208;过滤;

相似文献

外文文献
中文文献
专利

1. Transformation-Based Streaming Workflow Allocation on Geo-Distributed Datacenters for Streaming Big Data Processing [J] . Chen Wuhui, Paik Incheon, Hung Patrick C. K. Services Computing, IEEE Transactions on . 2019,第4期

机译：地理分布数据中心上基于转换的流工作流分配，用于流式处理大数据
2. Reliable stream data processing for elastic distributed stream processing systems [J] . Wei Xiaohui, Zhuang Yuan, Li Hongliang, Cluster computing . 2020,第2期

机译：弹性分布式流处理系统的可靠流数据处理
3. Correction to: Semantic annotation of summarized sensor data stream for effective query processing [J] . Pacha Shobharani, Murugan Suresh Ramalingam, Sethukarasi R. Journal of supercomputing . 2020,第6期

机译：校正：关于有效查询处理的总结传感器数据流的语义注释
4. Duality-Based Locality-Aware Stream Partitioning in Distributed Stream Processing Engines [C] . Siwoon Son, Yang-Sae Moon European Conference on Parallel Processing . 2019

机译：分布式流处理引擎中基于对偶性的位置感知流分区
5. A Grid Partition-Based Local Outlier Factor for Big Data Stream Processing [D] . Alsini, Raed A. 2021

机译：基于网格分区的本地异常因素，用于大数据流处理
6. A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring [O] . Adeyinka Akanbi, Muthoni Masinde 2020

机译：大数据平台上异构数据实时分析的分布式流处理中间件框架：环境监测案例
7. Summarizing distributed data streams for storage in data warehouses [O] . Raja Chiky, Telecom Paristech 2008

机译：汇总分布式数据流以存储在数据仓库中
8. Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix D. Analysis of MIMD (Multiple Instruction Streams, Multiple Data Streams) Algorithms: Features, Measurements, and Results [R] . Smith, K. D. 1984

机译：信号处理的分布式计算：异步并行计算的建模。附录D. mImD（多指令流，多数据流）算法的分析：特征，测量和结果

Pre-filtering based summarization for data partitioning in distributed stream processing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅