CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks

机译：斩波器：优化内存数据分区的数据分区分析框架

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The performance of in-memory based data analytic frameworks such as Spark is significantly affected by how data is partitioned. This is because the partitioning effectively determines task granularity and parallelism. Moreover, different phases of a workload execution can have different optimal partitions. However, in the current implementations, the tuning knobs controlling the partitioning are either configured statically or involve a cumbersome programmatic process for affecting changes at runtime. In this paper, we propose CHOPPER, a system for automatically determining the optimal number of partitions for each phase of a workload and dynamically changing the partition scheme during workload execution. CHOPPER monitors the task execution and DAG scheduling information to determine the optimal level of parallelism. CHOPPER repartitions data as needed to ensure efficient task granularity, avoids data skew, and reduces shuffle traffic. Thus, CHOPPER allows users to write applications without having to hand-tune for optimal parallelism. Experimental results show that CHOPPER effectively improves workload performance by up to 35.2% compared to standard Spark setup.

机译：基于内存的数据分析框架（例如Spark）的性能受到数据被分区方式的显着影响。这是因为划分有效地确定了任务粒度和并行性。此外，工作负载执行的不同阶段可以具有不同的最佳分区。然而，在当前实现中，控制划分的调谐旋钮是静态配置的或涉及用于在运行时影响变化的麻烦的编程过程。在本文中，我们提出了斩波器，一种用于自动确定工作负载的每个阶段的最佳分区的系统，并在工作负载执行期间动态地改变分区方案。斩波器监视任务执行和DAG调度信息以确定并行度的最佳水平。斩波器根据需要重新分区数据，以确保有效的任务粒度，避免数据偏差，并减少随机流量。因此，斩波器允许用户编写应用而不需要用于最佳并行性的操作。实验结果表明，与标准火花设置相比，斩波器有效地提高了35.2％的工作量性能。

著录项

来源
《International Conference on Cluster Computing》|2016年|xxviii 537 p. :|共10页
会议地点
作者
Arnab Kumar Paul; Wenjie Zhuang; Luna Xu; Min Li; M. Mustafa Rafique; Ali R. Butt;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
Sparks; Choppers (circuits); Parallel processing; Data analysis; Tuning; Job shop scheduling;

机译：火花;斩波器（电路）;并行处理;数据分析;调整;工作商店安排;

相似文献

外文文献
中文文献
专利

1. ClimateSpark: An in-memory distributed computing framework for big climate data analytics [J] . Hu Fei, Yang Chaowei, Schnase John L., Computers & geosciences . 2018,第JUNa期

机译：ClimateSpark：用于大气候数据分析的内存分布式计算框架
2. Optimizing the Analytical Value of Oncology-Related Data Based on an In-Memory Analysis Layer: Development and Assessment of the Munich Online Comprehensive Cancer Analysis Platform [J] . Daniel Nasseh, Sophie Schneiderbauer, Michael Lange, Journal of medical Internet research . 2020,第4期

机译：基于内存分析层优化义科相关数据的分析价值：慕尼黑在线综合癌症分析平台的开发和评估
3. Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework [J] . Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, BMC Genomics . 2020,第S10期

机译：使用Apache Arrow内存数据框架优化Gatk工作流的性能
4. CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks [C] . Arnab Kumar Paul, Wenjie Zhuang, Luna Xu, IEEE International Conference on Cluster Computing . 2016

机译：章：为内存中数据分析框架优化数据分区
5. A study of an in-memory database system for real-time analytics on semi-structured data streams. [D] . Lu, Alan Wen Jun. 2015

机译：对用于半结构化数据流的实时分析的内存数据库系统的研究。
6. Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework [O] . Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, 2020

机译：使用Apache Arrow内存数据框架优化Gatk工作流的性能
7. CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics Frameworks [O] . Seongsoo Park, Minseop Jeong, Hwansoo Han 2021

机译：CCA：用于内存数据分析框架的成本容量感知缓存

CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks

摘要

著录项

相似文献

相关主题

期刊订阅