首页> 外文会议>International conference on very large data bases >Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters
【24h】

Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters

机译:了解集群中表放置方法的基本结构和基本问题

获取原文

摘要

A table placement method is a critical component in big data analytics on distributed systems. It determines the way how data values in a two-dimensional table are organized and stored in the underlying cluster. Based on Hadoop computing environments, several table placement methods have been proposed and implemented. However, a comprehensive and systematic study to understand, to compare, and to evaluate different table placement methods has not been done. Thus, it is highly desirable to gain important insights into the basic structure and essential issues of table placement methods in the context of big data processing infrastructures. In this paper, we present such a study. The basic structure of a data placement method consists of three core operations: row reordering, table partitioning, and data packing. All the existing placement methods are formed by these core operations with variations made by the three key factors: (1) the size of a horizontal logical subset of a table (or the size of a row group), (2) the function of mapping columns to column groups, and (3) the function of packing columns or column groups in a row group into physical blocks. We have designed and implemented a benchmarking tool to provide insights into how variations of each factor affect the I/O performance of reading data of a table stored by a table placement method. Based on our results, we give suggested actions to optimize table reading performance. Results from large-scale experiments have also confirmed that our findings are valid for production workloads. Finally, we present ORC File as a case study to show the effectiveness of our findings and suggested actions.
机译:在分布式系统上的大数据分析中,表放置方法是至关重要的组件。它确定二维表中的数据值如何组织和存储在基础群集中的方式。基于Hadoop计算环境,已经提出并实现了几种表格放置方法。但是,尚未进行全面,系统的研究以了解,比较和评估不同的桌子放置方法。因此,非常需要在大数据处理基础架构中获得对表放置方法的基本结构和基本问题的重要见解。在本文中,我们提出了这样的研究。数据放置方法的基本结构包括三个核心操作:行重新排序,表分区和数据打包。这些核心操作形成了所有现有的放置方法,并通过三个关键因素进行了更改:(1)表的水平逻辑子集的大小(或行组的大小),(2)映射功能列到列组,以及(3)将行组中的列或列组打包为物理块的功能。我们已经设计并实现了一个基准测试工具,以深入了解每个因素的变化如何影响通过表放置方法存储的表的读取数据的I / O性能。根据我们的结果,我们提出了建议的操作以优化表读取性能。大规模实验的结果也证实了我们的发现对生产工作量是有效的。最后,我们以ORC文件为案例研究,以显示我们的发现和建议的措施的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号