首页> 外文期刊>Cluster computing >Effective and efficient data sampling using bitmap indices
【24h】

Effective and efficient data sampling using bitmap indices

机译:使用位图索引进行有效而高效的数据采样

获取原文
获取原文并翻译 | 示例
       

摘要

With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.
机译:随着并行机计算能力的不断提高,正在更精细的时空尺度上进行科学模拟,从而导致数据爆炸。不断增长的大小使存储,管理,分发,分析和可视化这些数据集变得异常困难,特别是因为并行计算机的内存容量,内存访问速度和磁盘带宽都没有以与计算能力相同的速度增长。采样可能是解决上述挑战的有效技术,但是确保保留数据集特征且准确性损失在可接受的水平之内极为重要。在本文中,我们通过开发一种新颖的采样方法并在支持服务器端采样和数据子集的灵活系统中实现该方法来解决数据爆炸问题。我们观察到,为了允许对科学数据集进行子集设置,数据存储库可能会使用索引技术。在这些技术中,我们看到位图索引不仅可以有效支持科学数据集的子集,而且还可以帮助创建保留科学数据集的值和空间分布的样本。我们已经开发了使用位图索引对数据集进行采样的算法。我们还显示了只有少量附加的位元数据存储的元数据如何可以帮助评估特定子采样级别的准确性损失。此新颖方法的其他一些属性包括:(1)可以将采样灵活地应用于原始数据集的子集,可以使用基于值和/或基于维度的子集谓词来指定采样,以及(2)一旦生成了位图索引,就不需要数据重组。我们已经使用不同类型的数据集和应用程序广泛评估了我们的方法,并证明了该方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号