Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

机译：多阶段和并行大数据框架的中间数据缓存优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer's discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%.

机译：在大数据和云计算时代，大量数据是从用户应用程序生成的，需要在数据中心中进行处理。数据并行计算框架（例如Apache Spark）被广泛用于大规模执行此类数据处理。具体来说，Spark利用分布式内存来缓存中间结果，这些中间结果表示为弹性分布式数据集（RDD）。通过避免重复计算或访问RDD的硬盘访问，Spark在迭代机器学习和数据挖掘算法的实现上优于其他并行框架。默认情况下，缓存决策由程序员自行决定，当缓存已满时，LRU策略用于收回RDD。但是，当目标是最大程度地减少总工作量时，LRU严重不足，从而导致任意次优的缓存决策。在本文中，我们设计了一种用于多阶段大数据处理平台的算法，以自适应地确定和缓存将来可以重用的最有价值的中间数据集。我们的解决方案可以自动决定要缓存哪些RDD：这相当于在直接非循环图（DAG）中标识代表计算结果应保留在内存中的节点。我们的实验结果表明，我们提出的缓存优化解决方案可以提高Spark上机器学习应用程序的性能，从而使重新计算RDD的总工作量减少12％。

著录项

来源
《IEEE International Conference on Cloud Computing》|2018年|277-284|共8页
会议地点
作者
Zhengyu Yang; Danlin Jia; Stratis Ioannidis; Ningfang Mi; Bo Sheng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparks; Task analysis; Optimization; Machine learning; Clustering algorithms; Companies;

机译：火花;任务分析;优化;机器学习;聚类算法;公司;

相似文献

外文文献
中文文献
专利

1. Optimization of ETL Process in Data Warehouse Through a Combination of Parallelization and Shared Cache Memory [J] . Faridi Masouleh M., Afshar Kazemi M. A., Alborzi M., Engineering Technology and Applied Science Research . 2016,第6期

机译：通过并行化和共享高速缓存的组合优化数据仓库中的ETL过程
2. High Performance Computation of Big Data: Performance Optimization Approach towards a Parallel Frequent Item Set Mining Algorithm for Transaction Data based on Hadoop MapReduce Framework [J] . Guru Prasad M S, Nagesh H R, Swathi Prabhu International Journal of Intelligent Systems and Applications . 2017,第1期

机译：大数据的高性能计算：基于Hadoop MapReduce框架的事务数据并行频繁项集挖掘算法的性能优化方法
3. Active semantic caching to optimize multidimensional data analysis in parallel and distributed environments [J] . Henrique Andrade, Tahsin Kurc, Alan Sussman, Parallel Computing . 2007,第7a8期

机译：主动语义缓存可优化并行和分布式环境中的多维数据分析
4. Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks [C] . Zhengyu Yang, Danlin Jia, Stratis Ioannidis, IEEE International Conference on Cloud Computing . 2018

机译：多级和并行大数据框架的中间数据缓存优化
5. Cache Analysis and Techniques for Optimizing Data Movement Across the Cache Hierarchy for HPC Workloads [D] . Deshpande, Aditya Madhusudan. 2019

机译：用于优化HPC工作负载的缓存层次结构数据移动的缓存分析和技术
6. TORNADO: Intermediate Results Orchestration Based Service-Oriented Data Curation Framework for Intelligent Video Big Data Analytics in the Cloud [O] . Aftab Alam, Young-Koo Lee 2020

机译：TORNADO：基于中间结果业务流程的面向服务的数据管理框架用于云中的智能视频大数据分析
7. Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks [O] . Zhengyu Yang, Danlin Jia, Stratis Ioannidis, 2018

机译：多级和并行大数据框架的中间数据缓存优化

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

摘要

著录项

相似文献

相关主题

期刊订阅