首页> 外文会议>International Conference on Cloud Computing >A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis
【24h】

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

机译:一种两阶段数据处理算法,用于生成大数据分析的随机样本分区

获取原文

摘要

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.
机译:为了使分布式大数据集的单个数据块文件用作大数据分析的随机样本,在本文中提出了两阶段数据处理(TSDP)算法,以将大数据集转换为随机样本分区(RSP)表示确保RSP中的每个单独数据块是大数据的随机样本,因此,它可以用于估计大数据的统计特性。该算法的第一阶段是将大数据集成到非重叠子集中,并将这些子集分发给群集的节点。第二阶段是从每个子集中采取随机样本而不替换以形成保存为RSP数据块文件的新子集,重复随机采样步骤,直到所有子集中的所有数据记录都用完,并且新的RSP数据集创建块文件以形成大数据的RSP。正式证明,每个RSP数据块的样本分布函数(S.F.)的期望等于S.D.f.因此,大数据集,每个RSP数据块是大数据集的随机样本。提出了Apache Spark和HDFS上的TSDP算法的实现。 Terabyte数据集的性能评估显示了该算法在将HDFS大数据文件转换为HDFS RSP大数据文件中的效率。我们还显示了一个仅使用少量RSP数据块的示例来构建组合模型,该模型比从整个数据集内置的单个模型更好地执行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号