Managing large multidimensional array hydrologic datasets: a case study comparing NetCDF and SciDB

机译：管理大型多维阵列水文数据集：NetCDF和SCIDB比较的案例研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Management of large hydrologic datasets including storage, structuring, indexing and query is one of the crucial challenges in the era of big data. This research originates from a specific data query problem: time series extraction at specific locations takes a long time when a large multidimensional dataset is stored in non-chunked NetCDF classic or 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file based solutions and a multidimensional (MD) array database management system (DBMS) applying chunked storage structure are benchmarked to determine the best solution for storing and querying large hydrologic datasets. To achieve this, expert consultancy was conducted to establish benchmark sets. To guarantee a fair benchmark test environment, HydroNET-4 system was utilized and adapters for NetCDF files and SciDB were developed to manage and query data. In final benchmark tests, effect of data storage configurations such as chunk size and compression on query performance is also explored. Results indicate that SciDB arrays utilizing small chunk sizes show favorable performance. However with current implementation of SciDB, large numbers of small chunks cause huge overload of main memory which constraints SciDB's scalability. Compression of SciDB can either have negative or no effect on query performance, while it causes significant query degradation to NetCDF-4 solution. The research illustrates that for big hydrologic array data management, the properly chunked NetCDF-4 solution without compression is in general more efficient than the SciDB DBMS. So under current big data environment, traditionally adopted file-based hydroinformatic solutions can still be applicable after proper updating.

机译：大型水文数据集管理，包括存储，结构，索引和查询是大数据时代的关键挑战之一。本研究源自特定的数据查询问题：当大型多维数据集存储在非块NetCDF经典或64位偏移格式时，特定位置处的时间序列提取需要很长时间。这个问题的本质在于netcdf采用的连续存储结构。在本研究中，基于NetCDF文件的解决方案和应用块存储结构的多维（MD）阵列数据库管理系统（DBMS）是基准测试的，以确定用于存储和查询大型水文数据集的最佳解决方案。为实现这一目标，进行了专家咨询以建立基准集。为保证公平的基准测试环境，利用了Hydronet-4系统，并开发了用于管理和查询数据的NetCDF文件和SCIDB的适配器。还探讨了在最终的基准测试中，还探讨了数据存储配置（如块大小和查询性能压缩）的效果。结果表明，利用小块尺寸的SCIDB阵列显示出有利的性能。然而，随着SCIDB的当前实现，大量的小块导致主存储器的巨大过载，这限制了SCIDB的可扩展性。压缩SCIDB可以对查询性能产生负面或没有影响，而导致NetCDF-4解决方案会导致显着的查询劣化。该研究说明，对于大型水文阵列数据管理，没有压缩的正确块NetCDF-4解决方案通常比SCIDB DBMS更有效。因此，在当前的大数据环境下，传统上采用的基于文件的水力系列解决方案仍可在适当的更新后适用。

著录项

来源
《International Conference on Hydroinformatics》|2016年|2 v. (1447 p.)|共8页
会议地点
作者
Haicheng Liu; Peter van Oosterom; Chengfang Hu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类工程水文学;
关键词
benchmark; NetCDF; SciDB; chunked storage structure; hydrologic dataset;

机译：基准;netcdf;scidb;块贮存结构;水文数据集;

相似文献

外文文献
中文文献
专利

1. Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB [J] . Liu Haicheng, van Oosterom Peter, Tijssen Theo, Journal of Hydroinformatics . 2018,第5a6期

机译：管理大型多维水文数据集：比较NetCDF和SciDB的案例研究
2. Open and scalable analytics of large Earth observation datasets: From scenes to multidimensional arrays using SciDB and GDAL [J] . Appel Marius, Lahn Florian, Buytaert Wouter, ISPRS Journal of Photogrammetry and Remote Sensing . 2018,第APRa期

机译：大型地球观测数据集的开放式和可扩展分析：使用SciDB和GDAL，从场景到多维阵列
3. Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files [J] . Delaunay Xavier, Courtois Aurélie, Gouillon Flavien Geoscientific Model Development Discussions . 2019,第9期

机译：NetCDF-4或HDF5文件压缩科学数据集压缩的无损和有损算法的评估
4. Managing large multidimensional array hydrologic datasets: a case study comparing NetCDF and SciDB [C] . Haicheng Liu, Peter van Oosterom, Chengfang Hu International Conference on Hydroinformatics . 2016

机译：管理大型多维阵列水文数据集：NetCDF和SCIDB比较的案例研究
5. Managing large multidimensional datasets inside a database system. [D] . Chakrabarti, Kaushik. 2001

机译：在数据库系统内管理大型多维数据集。
6. A pair of datasets for microRNA expression profiling to examine the use of careful study design for assigning arrays to samples [O] . Li-Xuan Qin, Huei-Chung Huang, Liliana Villafania, 2018

机译：一对用于microRNA表达谱分析的数据集以检查使用仔细的研究设计将阵列分配给样品的用途
7. Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB [O] . Liu, H., van Oosterom, P.J.M., Hu, C., 2016

机译：管理大型多维阵列水文数据集：比较NetCDF和SciDB的案例研究

Managing large multidimensional array hydrologic datasets: a case study comparing NetCDF and SciDB

摘要

著录项

相似文献

相关主题

期刊订阅