首页> 外文会议>IEEE International Conference on Cluster Computing >Fast data analysis with integrated statistical metadata in scientific datasets
【24h】

Fast data analysis with integrated statistical metadata in scientific datasets

机译:利用科学数据集中的集成统计元数据进行快速数据分析

获取原文

摘要

Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data-intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. Recent studies have started to utilize indexing, subsetting, and data reorganization to manage the increasingly large datasets. In this work, we present an approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. Various subsetting schemes can affect the access pattern and the I/O performance. We present a comparison study of different subsetting schemes by focusing on three dominant factors, the shape, the concurrency, and the locality. The added statistical metadata slightly increases the original data size, and we evaluate the cost and trade-off as well. This work is the first study that utilizes statistical metadata with various subsetting schemes to perform fast queries and analyses on large datasets. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.
机译:科学数据集和库,如HDF5,ADIOS和NetCDF,在许多数据密集型应用中都被广泛使用。这些库具有其特殊文件格式和I / O功能,可提供对大型数据集的有效访问。最近的研究已经开始利用索引,子集和数据重组来管理越来越大的数据集。在这项工作中,我们介绍了一种通过数据子集和将少量统计数据集成到原始数据集中的数据分析性能,即快速分析的方法来提高数据分析性能,即快速分析。添加的统计信息说明了数据形状并提供数据分布的知识;因此,原始I / O库可以利用这些统计元数据来执行快速查询和分析。各种子集方案可以影响访问模式和I / O性能。我们通过专注于三个主导因素,形状,并发性和局部性来展示不同子集计划的比较研究。添加的统计元数据略微增加了原始数据大小,我们也评估了成本和权衡。这项工作是第一项研究,它利用具有各种子集方案的统计元数据来执行快速查询并在大型数据集上分析。该建议的FASM方法目前在光泽文件系统上用PnetCDF进行评估,但也可以与其他科学图书馆实施。 FASM可能会导致新的数据集设计,并且可能对大数据分析产生影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号