Fast data analysis with integrated statistical metadata in scientific datasets

机译：利用科学数据集中的集成统计元数据进行快速数据分析

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many data-intensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. Recent studies have started to utilize indexing, subsetting, and data reorganization to manage the increasingly large datasets. In this work, we present an approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. Various subsetting schemes can affect the access pattern and the I/O performance. We present a comparison study of different subsetting schemes by focusing on three dominant factors, the shape, the concurrency, and the locality. The added statistical metadata slightly increases the original data size, and we evaluate the cost and trade-off as well. This work is the first study that utilizes statistical metadata with various subsetting schemes to perform fast queries and analyses on large datasets. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.

机译：科学数据集和库，如HDF5，ADIOS和NetCDF，在许多数据密集型应用中都被广泛使用。这些库具有其特殊文件格式和I / O功能，可提供对大型数据集的有效访问。最近的研究已经开始利用索引，子集和数据重组来管理越来越大的数据集。在这项工作中，我们介绍了一种通过数据子集和将少量统计数据集成到原始数据集中的数据分析性能，即快速分析的方法来提高数据分析性能，即快速分析。添加的统计信息说明了数据形状并提供数据分布的知识;因此，原始I / O库可以利用这些统计元数据来执行快速查询和分析。各种子集方案可以影响访问模式和I / O性能。我们通过专注于三个主导因素，形状，并发性和局部性来展示不同子集计划的比较研究。添加的统计元数据略微增加了原始数据大小，我们也评估了成本和权衡。这项工作是第一项研究，它利用具有各种子集方案的统计元数据来执行快速查询并在大型数据集上分析。该建议的FASM方法目前在光泽文件系统上用PnetCDF进行评估，但也可以与其他科学图书馆实施。 FASM可能会导致新的数据集设计，并且可能对大数据分析产生影响。

著录项

来源
《IEEE International Conference on Cluster Computing》|2013年|1-8|共8页
会议地点
作者
Liu Jialin; Chen Yong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
FASM; big data; data-intensive computing; high performance computing; statistical techniques; storage systems;

机译：FASM;大数据;数据密集型计算;高性能计算;统计技术;存储系统;

相似文献

外文文献
中文文献
专利

1. An Integrated Approach for Statistical Genome Sequence Analysis between Genetic Datasets [J] . Hassan Mathkour, Muneer Ahmad, Hassan Mahmood khan International Journal of Technology . 2010,第1期

机译：遗传数据集之间统计基因组序列分析的一种综合方法
2. An integrated statistical comparative analysis between variant genetic datasets of Mus musculus [J] . Hassan Mathkour, Muneer Ahmad, Hassan Mehmood Khan International journal of computational intelligence in bioinformatics and systems biology . 2009,第2期

机译：小家鼠变异遗传数据集之间的综合统计比较分析
3. INTEGRATED STATISTICAL ANALYSIS OF cDNA MICROARRAY AND NIR SPECTROSCOPIC DATA APPLIED TO A HEMP DATASET [J] . T. H. Reijmers, C. Maliepaard, H. C. Van Den Broeck, Journal of Bioinformatics and Computational Biology . 2005,第4期

机译：CDNA微阵列和NIR光谱数据的集成统计分析应用于HEMP数据集
4. Fast data analysis with integrated statistical metadata in scientific datasets [C] . Liu Jialin, Chen Yong IEEE International Conference on Cluster Computing . 2013

机译：基于科学数据集中集成统计元数据的快速数据分析
5. Automation and Expansion of the Metagenomics Analysis Methodology Using Computational Tools and Statistical Methods to Support Small High-Dimensional Datasets [D] . Hopson, Lindsay M. 2021

机译：使用计算工具和统计方法来支持小型高维数据集的自动化和扩展方法
6. GenoSurf: metadata driven semantic search system for integrated genomic datasets [O] . Arif Canakoglu, Anna Bernasconi, Andrea Colombo, 2019

机译：GenoSurf：用于集成基因组数据集的元数据驱动的语义搜索系统
7. GenoSurf: metadata driven semantic search system for integrated genomic datasets [O] . Arif Canakoglu, Anna Bernasconi, Andrea Colombo, 2019

机译：Genosurf：综合基因组数据集的元数据驱动语义搜索系统
8. Geospatial Analysis Tool Kit for Regional Climate Datasets (GATOR) : An Open-source Tool to Compute Climate Statistic GIS Layers from Argonne Climate Modeling Results. [R] . Kuiper, J., Kotamarthi, V. R., Orr, A., 2017

机译：区域气候数据集地理空间分析工具包（GaTOR）：从阿贡气候模拟结果计算气候统计GIs层的开源工具。

Fast data analysis with integrated statistical metadata in scientific datasets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅