Effective and efficient data sampling using bitmap indices

Su Yu; Agrawal Gagan; Woodring Jonathan; Myers Kary; Wendelberger Joanne; Ahrens James

首页> 外文期刊>Cluster computing >Effective and efficient data sampling using bitmap indices

【24h】

Effective and efficient data sampling using bitmap indices

机译：使用位图索引进行有效而高效的数据采样

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

机译：随着并行机计算能力的不断提高，正在更精细的时空尺度上进行科学模拟，从而导致数据爆炸。不断增长的大小使存储，管理，分发，分析和可视化这些数据集变得异常困难，特别是因为并行计算机的内存容量，内存访问速度和磁盘带宽都没有以与计算能力相同的速度增长。采样可能是解决上述挑战的有效技术，但是确保保留数据集特征且准确性损失在可接受的水平之内极为重要。在本文中，我们通过开发一种新颖的采样方法并在支持服务器端采样和数据子集的灵活系统中实现该方法来解决数据爆炸问题。我们观察到，为了允许对科学数据集进行子集设置，数据存储库可能会使用索引技术。在这些技术中，我们看到位图索引不仅可以有效支持科学数据集的子集，而且还可以帮助创建保留科学数据集的值和空间分布的样本。我们已经开发了使用位图索引对数据集进行采样的算法。我们还显示了只有少量附加的位元数据存储的元数据如何可以帮助评估特定子采样级别的准确性损失。此新颖方法的其他一些属性包括：（1）可以将采样灵活地应用于原始数据集的子集，可以使用基于值和/或基于维度的子集谓词来指定采样，以及（2）一旦生成了位图索引，就不需要数据重组。我们已经使用不同类型的数据集和应用程序广泛评估了我们的方法，并证明了该方法的有效性。

著录项

来源
《Cluster computing》 |2014年第4期|共20页
作者
Su Yu; Agrawal Gagan; Woodring Jonathan; Myers Kary; Wendelberger Joanne; Ahrens James;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类分子生物学;
关键词
Big data; Bitmap indexing; Data sampling; Multi-resolution; Parallel processing;

机译：大数据;位图索引;数据采样;多分辨率;并行处理;

相似文献

外文文献
中文文献
专利

1. Effective and efficient data sampling using bitmap indices [J] . Su Yu, Agrawal Gagan, Woodring Jonathan, Cluster computing . 2014,第4期

机译：使用位图索引进行有效而高效的数据采样
2. Semantic Data Analysis Using Bitmap Indices [J] . Carlo DELLAQUILA, Ezio LEFONS, Filippo TANGORRA WSEAS Transactions on Computers . 2007,第1期

机译：使用位图索引的语义数据分析
3. An Efficient Protocol for RFID Multigroup Threshold-Based Classification Based on Sampling and Logical Bitmap [J] . Luo Wen, Qiao Yan, Chen Shigang, Networking, IEEE/ACM Transactions on . 2016,第1期

机译：基于采样和逻辑位图的基于RFID多组阈值分类的高效协议
4. HDT Bitmap Triple Indices for Efficient RDF Data Exploration [C] . Maximilian Wenzel, Thorsten Liebig, Birte Glimm European Semantic Web Conference . 2021

机译：高效RDF数据探索的HDT位图三重指标
5. Estimating effective sample size for spatially correlated data. [D] . Smith, Rebecca. 2014

机译：估算空间相关数据的有效样本量。
6. Deriving objectively-measured sedentary indices from free-living accelerometry data in rural and urban African settings: a cost effective approach [O] . Ian Cook 2019

机译：从非洲农村和城市地区自由生活的加速度计数据得出客观测量的久坐指数：一种经济有效的方法
7. DEX: Increasing the Capability of Scientific Data Analysis Pipelines by Using Efficient Bitmap Indices to Accelerate Scientific Visualization [O] . Stockinger, Kurt, Shalf, John, Bethel, Wes, 2005

机译：DEX：利用有效的位图指标提高科学数据分析管道的能力，加速科学可视化

Effective and efficient data sampling using bitmap indices

摘要

著录项

相似文献

相关主题

期刊订阅