Deca: A Garbage Collection Optimizer for In-Memory Data Processing

Shi Xuanhua; Ke Zhixiang; Zhou Yongluan; Jin Hai; Lu Lu; Zhang Xiong; He Ligang; Hu Zhenyu; Wang Fei

首页> 外文期刊>ACM transactions on computer systems >Deca: A Garbage Collection Optimizer for In-Memory Data Processing

【24h】

Deca: A Garbage Collection Optimizer for In-Memory Data Processing

机译：Deca：用于内存中数据处理的垃圾收集优化器

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, 1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2x to 22.7x speedup in terms of execution time in cases without data spilling and 16x to 41.6x speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems.

机译：中间数据的内存中缓存以及数据在混洗缓冲区中的主动组合已被证明在最大限度地减少Spark和Flink等大数据处理系统中的重新计算和I / O成本方面非常有效。但是，也已广泛报道这些技术将在堆中创建大量的长期数据对象。这些生成的对象可能会迅速使垃圾收集器饱和，尤其是在处理大型数据集时，因此限制了系统的可伸缩性。为消除此问题，我们提出了一种基于生命周期的内存管理框架，该框架通过自动分析用户定义的函数和数据类型来获得数据对象的预期生命周期，然后相应地分配和释放内存空间，以最大程度地减少垃圾回收。高架。特别是，我们在Deca上展示Deca，这是我们在Spark之上的提案的具体实现，它透明地将具有相似生存期的对象分解并分组为字节数组，并在生存期结束时完全释放它们的空间。当系统处理非常大的数据时，Deca还提供面向字段的存储页面以确保高压缩效率。使用合成数据集和实际数据集进行的大量实验研究表明，与Spark相比，Deca（1）可以减少垃圾收集时间达99.9％，（2）减少内存消耗达46.6％和存储空间减少23.4％，（3）在不发生数据溢出的情况下，执行时间提高1.2倍至22.7x，在数据溢出的情况下，实现16x到41.6x的速度提高；（4）与特定领域的系统相比，性能相似。

著录项

来源
《ACM transactions on computer systems》 |2018年第1期|3.1-3.47|共47页
作者
Shi Xuanhua; Ke Zhixiang; Zhou Yongluan; Jin Hai; Lu Lu; Zhang Xiong; He Ligang; Hu Zhenyu; Wang Fei;
展开▼
作者单位

Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

Univ Copenhagen, Dept Comp Sci, DK-2100 Copenhagen, Denmark;

Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

Alibaba Grp, Hangzhou, Zhejiang, Peoples R China;

Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, W Midlands, England;

Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Data processing system; distributed system; garbage collection; in-memory; memory management;

机译：数据处理系统;分布式系统;垃圾回收;内存中;内存管理;

相似文献

外文文献
中文文献
专利

1. Deca: A Garbage Collection Optimizer for In-Memory Data Processing [J] . Shi Xuanhua, Ke Zhixiang, Zhou Yongluan, ACM transactions on computer systems . 2018,第1期

机译：Deca：用于内存数据处理的垃圾收集优化器
2. Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection [J] . Yoonsang Kim, Jidong Huang, Sherry Emery Journal of medical Internet research . 2016,第2期

机译：垃圾进出：健康研究，信息流行病学和数字疾病检测中社交媒体数据使用的数据收集，质量评估和报告标准
3. An Efficient Data Migration Scheme to Optimize Garbage Collection in SSDs [J] . Wang Shunzhuo, Zhou You, Zhou Jiaona, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems . 2021,第3期

机译：一种有效的数据迁移方案，以优化SSD中的垃圾收集
4. Stark: Optimizing In-Memory Computing for Dynamic Dataset Collections [C] . Shen Li, Md Tanvir Amin, Raghu Ganti, IEEE International Conference on Distributed Computing Systems . 2017

机译：斯塔克：为动态数据集收集优化内存中的计算
5. The active memory processor: Hardware support for one-bit reference counting and mark-sweep garbage collection. [D] . Srisa-an, Witawas. 2002

机译：主动内存处理器：硬件支持一位参考计数和标记清除垃圾收集。
6. Data Processing and Information Classification—An In-Memory Approach [O] . Milena Andrighetti, Giovanna Turvani, Giulia Santoro, 2020

机译：数据处理和信息分类-内存中方法
7. Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection [O] . Yoonsang Kim, Jidong Huang, Sherry Emery 2016

机译：垃圾进入，垃圾出：数据收集，质量评估和社交媒体数据在卫生研究中使用的报告标准，信息化学和数字疾病检测
8. Real-time garbage collection for list processing using restructured cells for increased reference counter size [R] . 1990

机译：使用重组单元格进行列表处理的实时垃圾收集，以增加参考计数器大小

Deca: A Garbage Collection Optimizer for In-Memory Data Processing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅