Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster

机译：章鱼：基于地理分布的大数据分析集群的拥塞感知调度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.

机译：近年来，大数据分析框架迅速兴起。同时，它已成为跨地理分布的数据输入生成，存储和处理大量数据的常规方法。网络之间的数据传输所产生的网络拥塞成为地理分布环境中系统整体性能的主要瓶颈。许多现有方法通常在发生网络拥塞后对其进行处理，这从根本上不能解决问题。在本文中，我们就其工作完成时间着眼于在Apache Spark上的地理分布环境中预先预测和避免网络拥塞的问题。我们将此问题表述为运行时最小化问题，由于场景具有不同的数据中心，因此在实践中很难解决。为了解决这些挑战，我们提出了一种基于拥塞感知调度的模型。在该模型中，我们利用SDN（软件定义网络）预先检测来自不同数据中心的数据流的数据大小，然后分析数据特征，从而预先预测可导致网络拥塞的数据流，以便我们可以针对不同的流程起草两种方案。另外，当我们检测到网络拥塞时，我们会为拥塞流选择一条具有更大带宽的路径。该方法可以最大程度地减少网络拥塞，提高网络利用率，并提高地理分布环境中的系统性能。作为本文的重点，我们将基于现代数据处理框架Apache Spark的作业调度程序设计并实现我们提出的解决方案。

著录项

来源
《International Conference on Systems and Informatics》|2018年|490-495|共6页
会议地点
作者
Haizhou Du; Keke Zhang; Zhenchen Yang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Bandwidth; Task analysis; Big Data; Sparks; Switches; Data centers;

机译：带宽;任务分析;大数据;火花;交换机;数据中心;

相似文献

外文文献
中文文献
专利

1. Optimizing Geo-Distributed Data Analytics with Coordinated Task Scheduling and Routing [J] . IEEE Transactions on Parallel and Distributed Systems . 2020,第2期

机译：通过协调任务调度和路由优化地理分布数据分析
2. Graph partition–based data and task co-scheduling of scientific workflow in geo-distributed datacenters [J] . Jinghui Zhang, Jian Chen, Jun Zhan, Concurrency and computation: practice and experience . 2019,第24期

机译：地理分布数据中心中基于图分区的数据和科学工作流的任务协同调度
3. Graph partition–based data and task co-scheduling of scientific workflow in geo-distributed datacenters [J] . Jinghui Zhang, Jian Chen, Jun Zhan, Concurrency and computation: practice and experience . 2019,第24期

机译：地理分布数据中心中基于图分区的数据和科学工作流的任务协同调度
4. Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster [C] . Haizhou Du, Keke Zhang, Zhenchen Yang International Conference on Systems and Informatics . 2018

机译：八达通：基于在地理分布式大数据分析群集中的拥塞感知调度
5. Multi-tenant Geo-distributed Data Analytics [D] . Jonathan, Albert. 2019

机译：多租户地理分布式数据分析
6. A genetic algorithm-based job scheduling model for big data analytics [O] . Qinghua Lu, Shanshan Li, Weishan Zhang, -1

机译：基于遗传算法的大数据分析作业调度模型
7. A TTL-based Approach for Data Aggregation in Geo-distributed Streaming Analytics [O] . Dhruv Kumar, Jian Li, Abhishek Chandra, 2019

机译：基于TTL的地理分布式流分析中的数据聚合方法

Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster

摘要

著录项

相似文献

相关主题

期刊订阅