A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

Fei Hu; Chaowei Yang; Yongyao Jiang; Yun Li; Weiwei Song; Daniel Q. Duffy; John L. Schnase; Tsengdar Lee

首页> 外文期刊>International journal of digital Earth >A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

【24h】

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

机译：用HDFS优化Apache Spark的分层索引策略，以有效地查询大地理空间栅格数据

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.

机译：地球观测和模型模拟正在产生大型多维阵列的基于栅格数据。然而，由于地磁栅格数据模型，分布式物理数据存储模型和分布式计算框架中的数据流水线之间的不一致，难以有效地查询这些大栅格数据。为了有效地处理大地理空间数据，本文提出了三层分层索引策略，以通过以下几个方面优化Hadoop分布式文件系统（HDFS）的Apache Spark：（1）通过采用块数据结构来提高I / O效率; （2）通过构建全球指数（K-D树）保持工作量平衡和高数据局部; （3）通过构建本地索引（哈希表），使火花和HDF能够通过构建本地索引（哈希表）来自然地支持地理空间栅格数据格式（例如，HDF4，NetCDF4，GeoTiff）; （4）索引内存数据以进一步改进地理空间数据查询; （5）开发数据重新分区策略，以调整查询并行性，同时保持高数据局部性。以上策略是通过开发定制的RDD来实现的，并通过将性能与Spark SQL和Scispark的性能进行比较来实现。所提出的索引策略可以应用于其他分布式框架或基于云的计算系统，以便在高效率上本地支持大地理空间数据查询。

著录项

来源
《International journal of digital Earth》 |2020年第3期|共19页
作者
Fei Hu; Chaowei Yang; Yongyao Jiang; Yun Li; Weiwei Song; Daniel Q. Duffy; John L. Schnase; Tsengdar Lee;
展开▼
作者单位

NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

NSF Spatiotemporal Innovation Center and Dept. of Geography and GeoInformation Sciences George Mason University Fairfax VA USA;

Office of Computational and Information Sciences and Technology NASA Goddard Space Flight Center Greenbelt MD USA;

NASA Center for Climate Simulation Goddard Space Flight Center Greenbelt MD USA;

Earth Science Division NASA Headquarters Washington DC USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类地球物理学;
关键词
Big data; hierarchical indexing; multi-dimensional; Apache Spark; HDFS; distributed computing; GIS;

机译：大数据;分层索引;多维;Apache Spark;HDFS;分布式计算;GIS;

相似文献

外文文献
中文文献
专利

1. HMIBase: An Hierarchical Indexing System for Storing and Querying Big Data [J] . Shengmei Luo, Di Zhao, Wei Ge, 中兴通讯技术（英文版） . 2014,第004期
2. An Efficient Algorithm for Query Transformation in Semantic Query Optimization [J] . 高技术通讯（英文版） . 2002,第001期
3. Constructing a raster-based spatio-temporal hierarchical data model for marine fisheries application [J] . 海洋学报（英文版） . 2006,第001期
4. Studies for Optimization of Data Analysis Queries for HEP Using HERA—B Commissioning Data [J] . VascoAmaral, GuidoMoerkotte, 等高能物理与核物理计算国际会议公报：英文版 . 2001,第001期
5. HDFS Optimization Strategy Based On Hierarchical Storage of Hot and Cold Data [J] . Yuxin Guan, Zhiqiang Ma, Leixiao Li Procedia CIRP . 2019,第2期

机译：基于冷热数据分层存储的HDFS优化策略
6. An improved query optimization process in big data using ACO-GA algorithm and HDFS map reduce technique [J] . Kumar Deepak, Jha Vijay Kumar Distributed and Parallel Databases . 2021,第1期

机译：使用ACO-GA算法和HDFS地图的大数据中的改进查询优化过程
7. Efficiently Querying Vector and Raster Data [J] . Brisaboa Nieves R., de Bernardo Guillermo, Gutiérrez Gilberto A., The Computer journal . 2017,第9期

机译：高效查询矢量和栅格数据
8. Geo-Planar Indexing (GPI) - An efficient indexing scheme for fast retrieval of raster-based geospatial data in mobile GIS applications [C] . Shea Geoffrey Y.K., Cao Jiannong 2012 5th International Congress on Image and Signal Processing. . 2012

机译：地理平面索引（GPI）-一种有效的索引方案，用于在移动GIS应用程序中快速检索基于栅格的地理空间数据
9. Improving the performance of parallel SPARQL query processing on Apache Spark using Bloom filters. [D] . Kolla, Venubabu. 2016

机译：使用Bloom过滤器提高Apache Spark上并行SPARQL查询处理的性能。
10. A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark [O] . Amr Mohamed AbdelAziz, Taysir Soliman, Kareem Kamal A. Ghany, 2021

机译：一种混合多目标鲸类优化算法用于分析基于Apache Spark的微阵列数据
11. Geo-Planar Indexing (GPI) - an efficient indexing scheme for fast retrieval of raster-based geospatial data in mobile GIS applications [O] . Shea GYK, Cao J 2012

机译：地理平面索引（GPI）-一种高效的索引方案，用于在移动GIS应用程序中快速检索基于栅格的地理空间数据
12. HDF5-Fast Query: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying [R] . Bethel, E. W. 2006

机译：HDF5快速查询：用于简化数据存储，检索，索引和查询访问的apI

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

摘要

著录项

相似文献

相关主题

期刊订阅