首页> 外文学位 >Autonomic management of data streaming and in-transit processing for data intensive scientific workflows.
【24h】

Autonomic management of data streaming and in-transit processing for data intensive scientific workflows.

机译:数据流的自主管理和数据密集型科学工作流的在途处理。

获取原文
获取原文并翻译 | 示例

摘要

High-performance computing is playing an important role in science and engineering and is enabling highly accurate simulations, which provide insights into complex physical phenomena. A key challenge is managing the enormous data volumes and high data rates associated with these applications, so as to have minimal impact on the execution of the simulations. Furthermore these applications are based on seamless interactions and coupling between multiple and potentially distributed computational, data and information services. This requires addressing the natural mismatches in the ways data is represented in different workflow components and on a variety of machines, and being able to "outsource" the required data manipulation and transformation operations to less expensive commodity resources "in-transit". Satisfying these requirements is challenging, especially in large-scale and highly dynamic in-transit environments with shared computing and communication resources, resource heterogeneity in terms of capability, capacity, and costs, and where application behaviors, needs, and performance are highly variable.;In this research we address these requirements by developing a data streaming and in-transit data manipulation framework that provides mechanisms as well as the management strategies for large scale and wide-area data intensive scientific and engineering workflows. The main objectives of this research are: (1) developing an end-to-end QoS management framework for data intensive applications so that it is able to provide robust underlying support for asynchronous, high-throughput, low-latency data streaming, and (2) effectively and opportunistically utilize resources in-transit for data processing, to match data mismatches between application entities executing in scientific workflows.;In this thesis, we address problem at two levels, the first or application level deals with satisfying QoS goals at the end points. Specifically, it ensures that the data is delivered in a timely manner, with no loss at the source or destination, and with minimal storage requirements at the end-points. The solution couples model-based limited look-ahead controllers (LLC) with rule-based managers to satisfy data streaming requirements under various operating conditions. The second or in-transit level focuses on scheduling in-transit computations and data transfer in an opportunistic manner on the in-transit overlay resources taking into account the higher level QoS goals of the source and the sink. Additionally the in-transit level management is coupled with the application level management at end points to manage QoS of grid workflows.;This research is driven by the requirements of the Fusion Simulation Project (FSP), which forms the basis of a predictive plasma edge simulation capability to support next-generation burning plasma experiments such as the International Thermonuclear Experimental Reactor (ITER). These scientific workflows require in-transit data manipulation and streaming in a wide area environment.;The self-managing data streaming service developed using this approach for the FSP workflow minimizes streaming overheads on the executing simulation to about 2% of the simulation execution time, reduces buffer occupancy at the source and thus prevents data loss. Additionally experiments with self-managing data streaming and in-transit processing demonstrates that adaptive processing using this service during network congestions decreases average idle time per data block from 25% to 1%, thereby increasing utilization at critical times. Furthermore, coupling end-point and in-transit level management during congestion reduces average buffer occupancy at in-transit nodes from 80% to 60.8%, thereby reducing load and potential data loss.
机译:高性能计算在科学和工程中发挥着重要作用,并且可以实现高度精确的仿真,从而提供对复杂物理现象的见解。一个关键的挑战是管理与这些应用程序相关的巨大数据量和高数据速率,以使对模拟执行的影响最小。此外,这些应用程序是基于无缝交互以及多个并可能分布的计算,数据和信息服务之间的耦合。这需要解决在不同的工作流组件中以及在各种机器上表示数据的方式中的自然不匹配问题,并且能够“将”所需的数据操作和转换操作“外包”到较便宜的商品资源“在途”中。满足这些要求是具有挑战性的,尤其是在具有共享计算和通信资源,能力,容量和成本方面的资源异质性以及应用程序行为,需求和性能高度可变的大规模,高度动态的运输环境中。 ;在这项研究中,我们通过开发数据流和传输中的数据操作框架来满足这些要求,该框架提供了用于大规模和广域数据密集型科学和工程工作流的机制以及管理策略。这项研究的主要目标是:(1)为数据密集型应用程序开发端到端QoS管理框架,以便它能够为异步,高吞吐量,低延迟的数据流提供强大的基础支持,以及( 2)有效并机会地利用传输中的资源进行数据处理,以匹配在科学工作流中执行的应用程序实体之间的数据不匹配情况;终点。具体来说,它可确保及时交付数据,而不会在源或目的地上造成任何损失,并且在端点上的存储需求最少。该解决方案将基于模型的有限前瞻性控制器(LLC)与基于规则的管理器相结合,以满足各种操作条件下的数据流需求。第二或传输级别集中在考虑源和接收器的更高级别QoS目标的情况下,以机会方式在传输覆盖资源上调度传输计算和数据传输。此外,传输级别的管理与端点的应用程序级别的管理相结合,以管理网格工作流的QoS。该研究是由Fusion Simulation Project(FSP)的要求驱动的,该要求构成了可预测的等离子体边缘的基础支持下一代燃烧等离子体实验(例如国际热核实验反应堆(ITER))的模拟功能。这些科学的工作流程需要在广域环境中进行途中数据处理和流式传输。使用这种方法为FSP工作流程开发的自管理数据流式传输服务可将执行模拟的流式传输开销最小化到模拟执行时间的2%,减少源处的缓冲区占用,从而防止数据丢失。此外,使用自我管理数据流和传输中处理的实验表明,在网络拥塞期间使用此服务的自适应处理将每个数据块的平均空闲时间从25%减少到1%,从而提高了关键时刻的利用率。此外,在拥塞期间将端点和传输中级别管理耦合在一起,可以将传输中节点的平均缓冲区占用率从80%降低到60.8%,从而减少负载和潜在的数据丢失。

著录项

  • 作者

    Bhat, Viraj.;

  • 作者单位

    Rutgers The State University of New Jersey - New Brunswick.;

  • 授予单位 Rutgers The State University of New Jersey - New Brunswick.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 163 p.
  • 总页数 163
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号