首页> 外文会议>International conference on very large data bases >Lifetime-Based Memory Management for Distributed Data Processing Systems
【24h】

Lifetime-Based Memory Management for Distributed Data Processing Systems

机译:分布式数据处理系统基于生命周期的内存管理

获取原文

摘要

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the userdefined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.
机译:在中间数据的内存中缓存以及数据在随机缓冲区中的快速组合已被证明在最小化Spark和Flink等分布式数据处理系统中的重新计算和I / O成本方面非常有效。但是,也有广泛的报道表明,这些技术会在堆中创建大量的长期数据对象,这可能会迅速使垃圾收集器饱和,尤其是在处理大型数据集时,因此会限制系统的可伸缩性。 。为消除此问题,我们提出了一种基于生存期的内存管理框架,该框架通过自动分析用户定义的函数和数据类型,获得数据对象的预期生存期,然后相应地分配和释放内存空间,以最大程度地减少垃圾回收开销。特别是,我们介绍Deca,这是我们在Spark之上的建议的具体实现,它透明地将具有相似生存期的对象分解并分组为字节数组,并在生存期结束时完全释放它们的空间。使用合成数据集和真实数据集进行的广泛实验研究表明,与Spark相比,Deca能够(1)将垃圾收集时间减少多达99.9%,2)在执行时间方面实现高达22.7倍的加速在没有数据泄漏的情况下,以及在有数据泄漏的情况下,速度提高了41.6倍; 3)最多减少了46.6%的内存消耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号