首页> 外文会议>2016 IEEE International Conferences on Big Data and Cloud Computing, Social Computing and Networking, Sustainable Computing and Communication >Visualization and Adaptive Subsetting of Earth Science Data in HDFS: A Novel Data Analysis Strategy with Hadoop and Spark
【24h】

Visualization and Adaptive Subsetting of Earth Science Data in HDFS: A Novel Data Analysis Strategy with Hadoop and Spark

机译:HDFS中地球科学数据的可视化和自适应子集:使用Hadoop和Spark的新型数据分析策略

获取原文
获取原文并翻译 | 示例

摘要

Data analytics becomes increasingly important in big data applications. Adaptively subsetting large amounts of data to extract the interesting events such as the centers of hurricane or thunderstorm, statistically analyzing and visualizing the subset data, is an effective way to analyze ever-growing data. This is particularly crucial for analyzing Earth Science data, such as extreme weather. The Hadoop ecosystem (i.e., HDFS, MapReduce, Hive) provides a cost-efficient big data management environment and is being explored for analyzing big Earth Science data. Our study investigates the potential of a MapReduce-like paradigm to perform statistical calculations, and utilizes the calculated results to subset as well as visualize data in a scalable and efficient way. RHadoop and SparkR are deployed to enable R to access and process data in parallel with Hadoop and Spark, respectively. The regular R libraries and tools are utilized to create and manipulate images. Statistical calculations, such as maximum and average variable values, are carried with R or SQL. We have developed a strategy to conduct query and visualization within one phase, and thus significantly improve the overall performance in a scalable way. The technical challenges and limitations of both Hadoop and Spark platforms for R are also discussed.
机译:数据分析在大数据应用中变得越来越重要。自适应地设置大量数据以提取有趣的事件(例如飓风或雷暴中心),对子集数据进行统计分析和可视化,是分析不断增长的数据的有效方法。这对于分析地球科学数据(例如极端天气)尤其重要。 Hadoop生态系统(即HDFS,MapReduce,Hive)提供了一种经济高效的大数据管理环境,并且正在探索用于分析大地球科学数据。我们的研究调查了类似MapReduce的范例执行统计计算的潜力,并利用计算结果以可扩展且有效的方式对子集进行可视化处理。部署RHadoop和SparkR可使R分别与Hadoop和Spark并行访问和处理数据。常规R库和工具用于创建和处理图像。统计计算,例如最大和平均变量值,由R或SQL进行。我们已经开发出一种策略,可以在一个阶段内进行查询和可视化,从而以可扩展的方式显着提高整体性能。还讨论了针对R的Hadoop和Spark平台的技术挑战和局限性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号