首页> 外文期刊>Big Data Mining and Analytics >A survey of data partitioning and sampling methods to support big data analysis
【24h】

A survey of data partitioning and sampling methods to support big data analysis

机译:对支持大数据分析的数据分区和采样方法调查

获取原文
获取原文并翻译 | 示例
       

摘要

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
机译:具有共享信息的计算机集群是大数据处理和分析的主要计算平台。在集群计算中,数据分区和采样是两个基本策略,可以加快计算大数据并提高可扩展性。在本文中,我们对大数据处理和分析的数据分区和采样的方法和技术进行了全面的调查。我们首先概述了Hadoop集群的主流大数据框架。然后讨论包括三种经典水平分区方案的数据分区的基本方法:范围,哈希和随机分区。还讨论了Hadoop集群上的数据划分,其新的数据分区的新策略摘要,包括新的随机样本分区(RSP)分布式模型。然后研究了数据采样的经典方法,包括简单的随机采样,分层采样和储库采样。还讨论了计算集群上的两种大数据采样方法:记录级采样和块级采样。记录级别采样并不像大分布式数据上的块级采样等效。另一方面,用经典数据分区方法生成的数据块上的块级别采样不一定产生良好的代表性样本,用于近似对大数据的计算。在本调查中,我们还总结了在Hadoop集群上基于采样的近似的现行策略和相关工作。我们认为,应将数据分区和采样一起考虑,以构建在计算和统计方面可靠的近似集群计算框架。

著录项

  • 来源
    《Big Data Mining and Analytics》 |2020年第2期|85-101|共17页
  • 作者单位

    National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Shenzhen 518060 China and Big Data Institute College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China;

    National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Shenzhen 518060 China and Big Data Institute College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China;

    National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Shenzhen 518060 China and Big Data Institute College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China;

    National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Shenzhen 518060 China and Big Data Institute College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China;

    National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Shenzhen 518060 China and Big Data Institute College of Computer Science and Software Engineering Shenzhen University Shenzhen 518060 China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Big Data; Distributed databases; Computational modeling; Data models; Computer architecture; Sampling methods;

    机译:大数据;分布式数据库;计算建模;数据模型;计算机架构;采样方法;
  • 入库时间 2022-08-18 22:10:35

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号