A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

机译：一种两阶段数据处理算法，用于生成大数据分析的随机样本分区

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.

机译：为了使分布式大数据集的单个数据块文件用作大数据分析的随机样本，在本文中提出了两阶段数据处理（TSDP）算法，以将大数据集转换为随机样本分区（RSP）表示确保RSP中的每个单独数据块是大数据的随机样本，因此，它可以用于估计大数据的统计特性。该算法的第一阶段是将大数据集成到非重叠子集中，并将这些子集分发给群集的节点。第二阶段是从每个子集中采取随机样本而不替换以形成保存为RSP数据块文件的新子集，重复随机采样步骤，直到所有子集中的所有数据记录都用完，并且新的RSP数据集创建块文件以形成大数据的RSP。正式证明，每个RSP数据块的样本分布函数（S.F.）的期望等于S.D.f.因此，大数据集，每个RSP数据块是大数据集的随机样本。提出了Apache Spark和HDFS上的TSDP算法的实现。 Terabyte数据集的性能评估显示了该算法在将HDFS大数据文件转换为HDFS RSP大数据文件中的效率。我们还显示了一个仅使用少量RSP数据块的示例来构建组合模型，该模型比从整个数据集内置的单个模型更好地执行。

著录项

来源
《International Conference on Cloud Computing》|2018年|420p|共18页
会议地点
作者
Chenghao Wei; Salman Salloum; Tamer Z. Emara; Xiaoliang Zhang; Joshua Zhexue Huang; Yulin He;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP393-53;
关键词
Big data analysis; Random sample partition; RSP HDFS; Apache Spark;

机译：大数据分析;随机样品分区;RSP HDFS;Apache Spark;

相似文献

外文文献
中文文献
专利

1. Performance analysis of incremental data partitioning data mining algorithm for Relational Database on multicore processor [J] . Ramesh Singh Yadava International Journal of Artificial Intelligence and Knowledge Discovery . 2013,第4期

机译：多核处理器上关系数据库增量数据分区数据挖掘算法的性能分析
2. Random Sample Partition: A Distributed Data Model for Big Data Analysis [J] . Salloum Salman, Huan Joshua Zhexue, He Yulin IEEE transactions on industrial informatics . 2019,第11期

机译：随机样本分区：用于大数据分析的分布式数据模型
3. Iterative algorithm of discrete Fourier transform for processing randomly sampled NMR data sets [J] . Stanek J., Ko?miński W. Journal of Biomolecular NMR . 2010,第1期

机译：离散傅里叶变换的迭代算法，用于处理随机采样的NMR数据集
4. A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis [C] . Chenghao Wei, Salman Salloum, Tamer Z. Emara, Cloud computing - CLOUD 2018 . 2018

机译：生成随机样本分区进行大数据分析的两阶段数据处理算法
5. Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation [D] . Hidalgo Murillo, Manuel E. 2016

机译：在数据并行处理实现算法中，使用统计分析来改善数据划分
6. Random number datasets generated from statistical analysis of randomly sampled GSM recharge cards [O] . Hilary I. Okagbue, Abiodun A. Opanuga, Pelumi E. Oguntunde, 2017

机译：通过对随机采样的GSM充值卡进行统计分析生成的随机数数据集
7. ABSTRACT Various body parts or organs can be analysed to identify the different diseases in the human body. Fingernail analysis is one of the ways to identify disease in the human body. Nails are the body part which are farthest from the heart and therefore receive oxygen at last. As a result the nails are the first who show the symptoms of a disease in the human body. Fingernails can be easily captured for diagnosis and there are no heavy equipment or no specific conditions required to use nail image for disease diagnosis, like in other tests and scanning processes. Human nails deliver beneficial information about complaints or any nutritive imbalances in the human body depending upon their shape, texture and colour. In human beings, numerous systemic and skin diseases can be easily analyzed through careful examination of nails of both the limbs. A lot of nail illnesses have been found to be primary signs of numerous underlying systemic illnesses. The colour, texture or shape changes in nails are signs of many diseases mainly affecting nails. Considering all these properties of nails a system is proposed that uses digital image processing (DIP) methods for identifying such changes in the human nail to get more precise results and predict numerous diseases effortlessly. With the emerging Internet of Things (IOT) concept the generated report is made available remotely, this will help users to reduce transportation efforts. As the system has to deal with large and private data, the security of data must be ensured. To keep the data confidential, the Blockchain concept which is one of the most emerging concepts in the field of data management is used. The paper contains the implementation of the digital image processing for feature extraction of nail images, usage of IOT (ThingSpeak cloud) for data storage and implementation of Blockchain to keep the system secured and theft free. KEY WORDS: Int ernet of thin gs (IOT), Image proc essin g, Thin gSpeak, RG B vavalues, Mean pi xel vavalues, Bloc kchain , Hash key. Disease Diagnostic System: Abnormalities in Human Nail [O] . Pranav S. Wazarkar 2020

机译：摘要的各个身体部位或器官可被分析以识别在人体内的不同的疾病。指甲分析来识别人体疾病的方法之一。指甲是身体一部分是离心脏最远，因此在最后接受氧气。作为结果，指甲是第一谁表现出人体疾病的症状。指甲可以容易地捕获用于诊断和没有重装或需要使用指甲图像用于疾病诊断，比如在其他测试和扫描过程没有特定的条件。人的指甲提供有关投诉或取决于它们的形状，纹理和色彩在人体内的任何营养失衡有益的信息。在人类中，许多全身性皮肤疾病是可以很容易地通过两个四肢指甲的仔细检查分析。很多指甲病已发现众多潜在系统性疾病的主要症状。在指甲的颜色，质地和形状的变化是许多疾病主要影响指甲的迹象。考虑到所有的指甲的这些性能的系统被提出，用于识别人指甲这样的变化以获得更精确的结果，并毫不费力预测许多疾病用途的数字图像处理（DIP）方法。随着物联网（IOT）的概念，新兴的互联网将生成的报告提供远程，这将帮助用户降低运输工作。由于系统必须处理大量的私人数据，数据的安全性必须得到保证。为了保持数据的机密性，使用Blockchain的概念，它是在数据管理领域的大多数新兴的概念之一。本文包含了数字图像处理的指甲图像，IOT（ThingSpeak云）的使用为数据存储和执行Blockchain的特征提取的执行，以保持固定的系统和盗窃免费。关键词：诠释薄GS（IOT），图像的ERNET PROC essin克，薄型gSpeak，RG乙vavalues，平均数PI XEL vavalues，阵营kchain，哈希密钥。疾病诊断系统：在人类指甲异常
8. Investigation of spectral analysis techniques for randomly sampled velocimetry data [R] . Sree, Dave 1993

机译：随机采样测速数据的频谱分析技术研究

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

摘要

著录项

相似文献

相关主题

期刊订阅