首页> 外文期刊>International journal of digital Earth >A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data
【24h】

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

机译:用HDFS优化Apache Spark的分层索引策略,以有效地查询大地理空间栅格数据

获取原文
获取原文并翻译 | 示例
           

摘要

Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.
机译:地球观测和模型模拟正在产生大型多维阵列的基于栅格数据。然而,由于地磁栅格数据模型,分布式物理数据存储模型和分布式计算框架中的数据流水线之间的不一致,难以有效地查询这些大栅格数据。为了有效地处理大地理空间数据,本文提出了三层分层索引策略,以通过以下几个方面优化Hadoop分布式文件系统(HDFS)的Apache Spark:(1)通过采用块数据结构来提高I / O效率; (2)通过构建全球指数(K-D树)保持工作量平衡和高数据局部; (3)通过构建本地索引(哈希表),使火花和HDF能够通过构建本地索引(哈希表)来自然地支持地理空间栅格数据格式(例如,HDF4,NetCDF4,GeoTiff); (4)索引内存数据以进一步改进地理空间数据查询; (5)开发数据重新分区策略,以调整查询并行性,同时保持高数据局部性。以上策略是通过开发定制的RDD来实现的,并通过将性能与Spark SQL和Scispark的性能进行比较来实现。所提出的索引策略可以应用于其他分布式框架或基于云的计算系统,以便在高效率上本地支持大地理空间数据查询。

著录项

  • 来源
  • 作者单位

    NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

    NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

    NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

    NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

    NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

    Office of Computational and Information Sciences and Technology NASA Goddard Space Flight Center Greenbelt MD USA;

    NASA Center for Climate Simulation Goddard Space Flight Center Greenbelt MD USA;

    Earth Science Division NASA Headquarters Washington DC USA;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 地球物理学;
  • 关键词

    Big data; hierarchical indexing; multi-dimensional; Apache Spark; HDFS; distributed computing; GIS;

    机译:大数据;分层索引;多维;Apache Spark;HDFS;分布式计算;GIS;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号