首页> 外文OA文献 >Lifetime-Based Memory Management for Distributed Data Processing Systems
【2h】

Lifetime-Based Memory Management for Distributed Data Processing Systems

机译:分布式数据处理系统基于生命周期的内存管理

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.
机译:中间数据的内存中缓存和数据在混洗缓冲区中的渴望组合在减少Spark和Flink等分布式数据处理系统中的重新计算和I / O成本方面表现出了非常有效的作用。但是,据广泛报道,这些技术会在堆中创建大量的长期数据对象,这可能会使垃圾收集器快速饱和,尤其是在处理大型数据集时,因此会限制系统的可伸缩性。 。为了消除此问题,我们提出了一种基于生命周期的内存管理框架,该框架通过自动分析用户定义的函数和数据类型,获得数据对象的预期寿命,然后相应地分配和释放内存空间,以最大程度地减少垃圾收集开销。特别是,我们介绍了Deca,这是我们在Spark之上的建议的具体实现,它透明地将具有相似生存期的对象分解和分组为字节数组,并在生存期结束时完全释放它们的空间。使用合成数据集和实际数据集进行的广泛实验研究表明,与Spark相比,Deca能够(1)将垃圾收集时间减少多达99.9%,2)在执行时间方面实现高达22.7倍的加速在没有数据泄漏的情况下,以及在有数据泄漏的情况下,速度提高了41.6倍; 3)消耗的内存减少了46.6%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号