首页> 外文学位 >Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments.
【24h】

Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments.

机译:在带宽和空间受限的环境中支持数据密集型科学计算。

获取原文
获取原文并翻译 | 示例

摘要

Scientific applications, simulations and instruments generate massive amount of data. This data does not only contribute to the already existing scientific areas, but it also leads to new sciences. However, management of this large-scale data and its analysis are both challenging processes. In this context, we require tools, methods and technologies such as reduction-based processing structures, cloud computing and storage, and efficient parallel compression methods.;In this dissertation, we first focus on parallel and scalable processing of data stored in S3, a cloud storage resource, using compute instances in Amazon Web Services (AWS). We develop MATE-EC2 which allows specification of data processing using a variant of Map-Reduce paradigm. We show various optimizations, including data organization, job scheduling, and data retrieval strategies, that can be leveraged based on the performance characteristics of cloud storage resources. Furthermore, we investigate the efficiency of our middleware in both homogeneous and heterogeneous environments.;Next, we improve our middleware so that users can perform transparent processing on data that is distributed among local and cloud resources . With this work, we maximize the utilization of geographically distributed resources. We evaluate our system's overhead, scalability, and performance with varying data distributions.;The users of data-intensive applications have different requirements on hybrid cloud settings. Two of the most important ones are execution time of the application and resulting cost on the cloud. Our third contribution is providing a time and cost model for data-intensive applications that run on hybrid cloud environments. The proposed model lets our middleware adapt performance changes and dynamically allocate necessary resources from its environments. Therefore, applications can meet user specified constraints.;Fourth, we investigate compression approaches for scientific datasets and build a compression system. The proposed system focuses on implementation and application of domain specific compression algorithms. We port our compression system into aforementioned middleware and implement different compression algorithms. Our framework enables our middleware to maximize bandwidth utilization of data-intensive applications while minimizing storage requirements.;Although, compression can help us to minimize input and output overhead of data-intensive applications, utilization of compression during parallel operations is not trivial. Specifically, unable to determine compressed data chunk sizes in advance complicates the parallel write operations. In our final work, we develop different methods for enabling compression during parallel input and output operations. Then, we port our proposed methods into PnetCDF, a widely used scientific data management library, and show how transparent compression can be supported during parallel output operations. The proposed system lets an existing parallel simulation program start outputting and storing data in a compressed fashion. Similarly, data analysis applications can transparently access to compressed data using our system.
机译:科学应用,模拟和仪器会生成大量数据。这些数据不仅对已经存在的科学领域有所贡献,而且还带来了新的科学。但是,管理这种大规模数据及其分析都是具有挑战性的过程。在这种情况下,我们需要工具,方法和技术,例如基于约简的处理结构,云计算和存储以及高效的并行压缩方法。在本文中,我们首先关注存储在S3中的数据的并行和可伸缩处理。云存储资源,使用Amazon Web Services(AWS)中的计算实例。我们开发了MATE-EC2,它允许使用Map-Reduce范例的变体来规范数据处理。我们展示了各种优化,包括数据组织,作业调度和数据检索策略,这些优化可基于云存储资源的性能特征加以利用。此外,我们研究了同构和异构环境中中间件的效率。接下来,我们改进了中间件,以便用户可以对分布在本地和云资源之间的数据执行透明处理。通过这项工作,我们最大限度地利用了地理上分散的资源。我们通过变化的数据分布来评估系统的开销,可伸缩性和性能。数据密集型应用程序的用户对混合云设置有不同的要求。最重要的两个是应用程序的执行时间和在云上产生的成本。我们的第三项贡献是为在混合云环境中运行的数据密集型应用程序提供时间和成本模型。提出的模型使我们的中间件能够适应性能变化并从其环境中动态分配必要的资源。因此,应用程序可以满足用户指定的约束。第四,我们研究科学数据集的压缩方法并构建压缩系统。所提出的系统集中于领域专用压缩算法的实现和应用。我们将压缩系统移植到上述中间件中,并实现不同的压缩算法。我们的框架使我们的中间件能够最大化数据密集型应用程序的带宽利用率,同时最大程度地降低存储需求。虽然压缩可以帮助我们最大程度地减少数据密集型应用程序的输入和输出开销,但并行操作期间的压缩利用率并非易事。具体而言,无法预先确定压缩数据块的大小使并行写入操作复杂化。在最终工作中,我们开发了在并行输入和输出操作期间启用压缩的不同方法。然后,我们将提出的方法移植到PnetCDF(一个广泛使用的科学数据管理库)中,并说明如何在并行输出操作期间支持透明压缩。所提出的系统使现有的并行仿真程序开始以压缩方式输出和存储数据。同样,数据分析应用程序可以使用我们的系统透明地访问压缩数据。

著录项

  • 作者

    Bicer, Tekin.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 181 p.
  • 总页数 181
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号