首页> 外文会议>IEEE International Conference on Cloud Computing >Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks
【24h】

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

机译:多阶段和并行大数据框架的中间数据缓存优化

获取原文

摘要

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer's discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%.
机译:在大数据和云计算时代,大量数据是从用户应用程序生成的,需要在数据中心中进行处理。数据并行计算框架(例如Apache Spark)被广泛用于大规模执行此类数据处理。具体来说,Spark利用分布式内存来缓存中间结果,这些中间结果表示为弹性分布式数据集(RDD)。通过避免重复计算或访问RDD的硬盘访问,Spark在迭代机器学习和数据挖掘算法的实现上优于其他并行框架。默认情况下,缓存决策由程序员自行决定,当缓存已满时,LRU策略用于收回RDD。但是,当目标是最大程度地减少总工作量时,LRU严重不足,从而导致任意次优的缓存决策。在本文中,我们设计了一种用于多阶段大数据处理平台的算法,以自适应地确定和缓存将来可以重用的最有价值的中间数据集。我们的解决方案可以自动决定要缓存哪些RDD:这相当于在直接非循环图(DAG)中标识代表计算结果应保留在内存中的节点。我们的实验结果表明,我们提出的缓存优化解决方案可以提高Spark上机器学习应用程序的性能,从而使重新计算RDD的总工作量减少12%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号