Lifetime-Based Memory Management for Distributed Data Processing Systems

机译：分布式数据处理系统基于生命周期的内存管理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the userdefined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

机译：在中间数据的内存中缓存以及数据在随机缓冲区中的快速组合已被证明在最小化Spark和Flink等分布式数据处理系统中的重新计算和I / O成本方面非常有效。但是，也有广泛的报道表明，这些技术会在堆中创建大量的长期数据对象，这可能会迅速使垃圾收集器饱和，尤其是在处理大型数据集时，因此会限制系统的可伸缩性。。为消除此问题，我们提出了一种基于生存期的内存管理框架，该框架通过自动分析用户定义的函数和数据类型，获得数据对象的预期生存期，然后相应地分配和释放内存空间，以最大程度地减少垃圾回收开销。特别是，我们介绍Deca，这是我们在Spark之上的建议的具体实现，它透明地将具有相似生存期的对象分解并分组为字节数组，并在生存期结束时完全释放它们的空间。使用合成数据集和真实数据集进行的广泛实验研究表明，与Spark相比，Deca能够（1）将垃圾收集时间减少多达99.9％，2）在执行时间方面实现高达22.7倍的加速在没有数据泄漏的情况下，以及在有数据泄漏的情况下，速度提高了41.6倍; 3）最多减少了46.6％的内存消耗。

著录项

来源
《International conference on very large data bases》|2016年|936-947|共12页
会议地点
作者
Lu Lu; Xuanhua Shi; Yongluan Zhou; Xiong Zhang; Hai Jin; Cheng Pei; Ligang He; Yuanzhen Geng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. 基于多卫星分布式数据处理系统的高分三号卫星数据实时处理方法 [J] . 杨军, 曹筵东, 孙光才, 中南大学学报（英文版） . 2020,第003期
2. Evaluation of SQL benchmark for distributed in-memory Database Management Systems [J] . Oleg Borisenko, David Badalyan International journal of computer science and network security . 2018,第10期

机译：评估分布式内存数据库管理系统的SQL基准
3. Efficient distance join query processing in distributed spatial data management systems [J] . Information Sciences: An International Journal . 2020,第期

机译：分布式空间数据管理系统中的高效距离连接查询处理
4. Transaction Processing and Management in Distributed Database Systems [J] . International Journal of Computer Science and Technology . 2011,第3期

机译：分布式数据库系统中的事务处理和管理
5. Lifetime-Based Memory Management for Distributed Data Processing Systems [C] . Lu Lu, Xuanhua Shi, Yongluan Zhou, International conference on very large data bases . 2016

机译：基于寿命的分布式数据处理系统的内存管理
6. Data management in distributed stream processing systems. [D] . Vijayakumar, Nithya Nirmal. 2007

机译：分布式流处理系统中的数据管理。
7. Clinical Laboratory Data Management: A Distributed Data Processing Solution [O] . Martin Levin, Raymond Morgner, Bernice Packer 1980

机译：临床实验室数据管理：分布式数据处理解决方案
8. Lifetime-Based Memory Management for Distributed Data Processing Systems [O] . Lu Lu, Shi Xuanhua, Zhou Yongluan, 2016

机译：分布式数据处理系统基于生命周期的内存管理
9. Research in Functionally Distributed Computer Systems Development. Volume IX. Memory Management in a Distributed Data Base Management System [R] . Maryanski, F. J., Wallentine, V. 1976

机译：功能分布式计算机系统开发研究。第九卷。分布式数据库管理系统中的内存管理

Lifetime-Based Memory Management for Distributed Data Processing Systems

摘要

著录项

相似文献

相关主题

期刊订阅