Large experiments and high-performance computer models generate many petabytes of data. While Cloud Computing systems may meet the needs for analyzing these petabytes by harnessing the computing power of many distributed computers, the key challenge in effectively utilizing such a distributed system is the data management process, including storage, indexing, searching, accessing, and transferring data. Most analysis tasks perform computations on a subset of a large data records satisfying some user specified constraints on attribute (variable) values. This subsetting procedure is extremely important in that it reduces the network traffic to and from the cloud facilities. However, selected data records often span many different data files, and extracting the values out these files can be time-consuming especially if the number of files is large. This work addresses this challenge of working with a large number of files. We use a set of astronomical data set as an example and use an efficient database indexing technique, called FastBit, to significantly speed up the subsetting and thus optimize network usage. Overall, we aim to provide transparent and highly efficient attribute-based data access to scientists through a web-based Astronomy Data Analysis Portal. We will discuss the system design, and options for managing an extremely large number of files while minimizing network usage and latency.
展开▼
机译:大型实验和高性能计算机模型会生成数 PB 的数据。虽然云计算系统可以通过利用许多分布式计算机的计算能力来满足分析这些 PB 级的需求,但有效利用此类分布式系统的关键挑战是数据管理过程,包括存储、索引、搜索、访问和传输数据。大多数分析任务对大型数据记录的子集执行计算,这些记录满足用户对属性(变量)值的某些指定约束。此子集过程非常重要,因为它减少了进出云设施的网络流量。但是,选定的数据记录通常跨越许多不同的数据文件,从这些文件中提取值可能非常耗时,尤其是在文件数量很大的情况下。这项工作解决了处理大量文件的挑战。我们以一组天文数据集为例,并使用一种称为 FastBit 的高效数据库索引技术来显着加快子集化速度,从而优化网络使用。总体而言,我们的目标是通过基于 Web 的天文数据分析门户为科学家提供透明且高效的基于属性的数据访问。我们将讨论系统设计,以及管理大量文件的选项,同时最大限度地减少网络使用和延迟。
展开▼