首页> 外文期刊>Science of Computer Programming >RRPlib: A spark library for representing HDFS blocks as a set of random sample data blocks
【24h】

RRPlib: A spark library for representing HDFS blocks as a set of random sample data blocks

机译:RRPlib:一个火花库,用于将HDFS块表示为一组随机样本数据块

获取原文
获取原文并翻译 | 示例
           

摘要

Analyzing big data is a challenging problem in cluster computing systems especially when the data volume goes beyond the available computing resources. Sampling is the favored solution for such problems. It summarizes or reduces the amount of data, taking into consideration the statistical characteristics of data distribution. However, the traditional method to sample the massive data by drawing record-by-record is a computationally expensive process because a full scan of the whole data is needed to be performed. While if the massive data is partitioned into a set of data blocks with each block is a random sample data block, the processing time for selecting some blocks as a sample (or different samples) is computationally cheaper. The main purpose of the software described in this paper is to represent the HDFS blocks as a set of random sample data blocks which also stored in HDFS. Our empirical results show that the performance of the partitioning operation is acceptable in the real application especially this operation is only performed once, thereby analysis on terabyte data becomes more natural. (C) 2019 Elsevier B.V. All rights reserved.
机译:在集群计算系统中,分析大数据是一个具有挑战性的问题,特别是当数据量超出可用的计算资源时。采样是解决此类问题的首选方法。考虑到数据分布的统计特征,它汇总或减少了数据量。但是,由于需要对整个数据进行全面扫描,因此通过逐条记录绘制来采样海量数据的传统方法是计算量巨大的过程。虽然如果将海量数据划分为一组数据块,而每个块是一个随机样本数据块,则选择某些块作为样本(或不同样本)的处理时间在计算上会更便宜。本文所述软件的主要目的是将HDFS块表示为一组随机样本数据块,这些数据块也存储在HDFS中。我们的经验结果表明,分区操作的性能在实际应用中是可以接受的,尤其是该操作仅执行一次,因此对TB级数据的分析变得更加自然。 (C)2019 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Science of Computer Programming》 |2019年第1期|102301.1-102301.7|共7页
  • 作者单位

    Shenzhen Univ Coll Comp Sci & Software Engn Big Data Inst Shenzhen 518060 Guangdong Peoples R China|Shenzhen Univ Natl Engn Lab Big Data Syst Comp Technol Shenzhen 518060 Guangdong Peoples R China|Higher Inst Engn & Technol Kafrelsheikh Kafrelsheikh Egypt;

    Shenzhen Univ Coll Comp Sci & Software Engn Big Data Inst Shenzhen 518060 Guangdong Peoples R China|Shenzhen Univ Natl Engn Lab Big Data Syst Comp Technol Shenzhen 518060 Guangdong Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    HDFS; Random sample; Data partitioning; Distributed systems;

    机译:HDFS;随机抽样;数据分区;分布式系统;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号