首页> 外文期刊>ACM transactions on computer systems >Deca: A Garbage Collection Optimizer for In-Memory Data Processing
【24h】

Deca: A Garbage Collection Optimizer for In-Memory Data Processing

机译:Deca:用于内存中数据处理的垃圾收集优化器

获取原文
获取原文并翻译 | 示例

摘要

In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, 1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2x to 22.7x speedup in terms of execution time in cases without data spilling and 16x to 41.6x speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems.
机译:中间数据的内存中缓存以及数据在混洗缓冲区中的主动组合已被证明在最大限度地减少Spark和Flink等大数据处理系统中的重新计算和I / O成本方面非常有效。但是,也已广泛报道这些技术将在堆中创建大量的长期数据对象。这些生成的对象可能会迅速使垃圾收集器饱和,尤其是在处理大型数据集时,因此限制了系统的可伸缩性。为消除此问题,我们提出了一种基于生命周期的内存管理框架,该框架通过自动分析用户定义的函数和数据类型来获得数据对象的预期生命周期,然后相应地分配和释放内存空间,以最大程度地减少垃圾回收。高架。特别是,我们在Deca上展示Deca,这是我们在Spark之上的提案的具体实现,它透明地将具有相似生存期的对象分解并分组为字节数组,并在生存期结束时完全释放它们的空间。当系统处理非常大的数据时,Deca还提供面向字段的存储页面以确保高压缩效率。使用合成数据集和实际数据集进行的大量实验研究表明,与Spark相比,Deca(1)可以减少垃圾收集时间达99.9%,(2)减少内存消耗达46.6%和存储空间减少23.4%,(3)在不发生数据溢出的情况下,执行时间提高1.2倍至22.7x,在数据溢出的情况下,实现16x到41.6x的速度提高;(4)与特定领域的系统相比,性能相似。

著录项

  • 来源
    《ACM transactions on computer systems》 |2018年第1期|3.1-3.47|共47页
  • 作者单位

    Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

    Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

    Univ Copenhagen, Dept Comp Sci, DK-2100 Copenhagen, Denmark;

    Huazhong Univ Sci & Technol, Serv Comp Technol & Syst Lab, Big Data Technol & Syst Lab, Sch Comp Sci & Technol, 1037 Luoyu Rd, Wuhan 430074, Hubei, Peoples R China;

    Alibaba Grp, Hangzhou, Zhejiang, Peoples R China;

    Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

    Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, W Midlands, England;

    Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

    Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Data processing system; distributed system; garbage collection; in-memory; memory management;

    机译:数据处理系统;分布式系统;垃圾回收;内存中;内存管理;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号