首页> 外文会议>International Conference on Hydroinformatics >Managing large multidimensional array hydrologic datasets: a case study comparing NetCDF and SciDB
【24h】

Managing large multidimensional array hydrologic datasets: a case study comparing NetCDF and SciDB

机译:管理大型多维阵列水文数据集:NetCDF和SCIDB比较的案例研究

获取原文

摘要

Management of large hydrologic datasets including storage, structuring, indexing and query is one of the crucial challenges in the era of big data. This research originates from a specific data query problem: time series extraction at specific locations takes a long time when a large multidimensional dataset is stored in non-chunked NetCDF classic or 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file based solutions and a multidimensional (MD) array database management system (DBMS) applying chunked storage structure are benchmarked to determine the best solution for storing and querying large hydrologic datasets. To achieve this, expert consultancy was conducted to establish benchmark sets. To guarantee a fair benchmark test environment, HydroNET-4 system was utilized and adapters for NetCDF files and SciDB were developed to manage and query data. In final benchmark tests, effect of data storage configurations such as chunk size and compression on query performance is also explored. Results indicate that SciDB arrays utilizing small chunk sizes show favorable performance. However with current implementation of SciDB, large numbers of small chunks cause huge overload of main memory which constraints SciDB's scalability. Compression of SciDB can either have negative or no effect on query performance, while it causes significant query degradation to NetCDF-4 solution. The research illustrates that for big hydrologic array data management, the properly chunked NetCDF-4 solution without compression is in general more efficient than the SciDB DBMS. So under current big data environment, traditionally adopted file-based hydroinformatic solutions can still be applicable after proper updating.
机译:大型水文数据集管理,包括存储,结构,索引和查询是大数据时代的关键挑战之一。本研究源自特定的数据查询问题:当大型多维数据集存储在非块NetCDF经典或64位偏移格式时,特定位置处的时间序列提取需要很长时间。这个问题的本质在于netcdf采用的连续存储结构。在本研究中,基于NetCDF文件的解决方案和应用块存储结构的多维(MD)阵列数据库管理系统(DBMS)是基准测试的,以确定用于存储和查询大型水文数据集的最佳解决方案。为实现这一目标,进行了专家咨询以建立基准集。为保证公平的基准测试环境,利用了Hydronet-4系统,并开发了用于管理和查询数据的NetCDF文件和SCIDB的适配器。还探讨了在最终的基准测试中,还探讨了数据存储配置(如块大小和查询性能压缩)的效果。结果表明,利用小块尺寸的SCIDB阵列显示出有利的性能。然而,随着SCIDB的当前实现,大量的小块导致主存储器的巨大过载,这限制了SCIDB的可扩展性。压缩SCIDB可以对查询性能产生负面或没有影响,而导致NetCDF-4解决方案会导致显着的查询劣化。该研究说明,对于大型水文阵列数据管理,没有压缩的正确块NetCDF-4解决方案通常比SCIDB DBMS更有效。因此,在当前的大数据环境下,传统上采用的基于文件的水力系列解决方案仍可在适当的更新后适用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号