Data intensive scientific compute model for multicore clusters.

机译：多核集群的数据密集型科学计算模型。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data intensive computing holds the promise of major scientific breakthroughs and discoveries from the exploration and mining of the massive data sets becoming available to the science community. This expectation has led to tremendous increases in data intensive scientific applications. However, data intensive scientific applications still face severe challenges in accessing, managing and analyzing petabytes of data. In particular, workflow systems to support such scientific applications are not as efficient when dealing with thousands and even more of complex tasks within jobs that operate across high performance large multicore clusters with very large amounts of streaming data. Scheduling, it turns out, is an integral workflow component in the execution often of thousands or more tasks within a data intensive scientific application as well as in managing the access and flow of many jobs to the available resource environment. Recently, MapReduce systems such as Hadoop, have proven successful for many business data intensive problems. However, there are still many limitations in the use of MapReduce systems for data-intensive scientific problems mainly because they do not support the characteristics of science such as data formats, specialized data analytic tools (e.g. math libraries), accuracies, and interfaces with non MapReduce components.;This thesis addresses some of these limitations by proposing a MapReduce workflow model and its runtime system using Hadoop for orchestrating MapReduce jobs for data intensive scientific workflows. A heuristic based scheduling algorithm is proposed in the workflow system to manage the execution of data intensive scientific applications. This thesis has developed a hybrid-scheduling algorithm based on run time dynamic priorities, proportional resource sharing techniques that reduce delays for variable length concurrent tasks, and takes advantage of data locality. As a result, a new scheduling policy, Balanced Closer to Finish First (BCFF), is proposed as solutions for some problems of scheduling in MapReduce environment. The scheduling algorithm is implemented in Hadoop 1.0.1 framework as a new Hadoop 1.0.1 plug-in Scheduler. The evaluations of the workflow system on the climate data processing and analysis application (several TB dataset) show that it is feasible and significantly improved compared to traditional parallel processing method. The scientific results of the application provide new source of monitoring global climate changes for the near decade 2002-2011.

机译：数据密集型计算有望通过科学界对海量数据集的探索和挖掘获得重大科学突破和发现。这种期望导致数据密集型科学应用的巨大增长。但是，数据密集型科学应用程序在访问，管理和分析PB级数据方面仍然面临严峻挑战。特别是，在处理跨具有大量流数据的高性能大型多核群集运行的作业中成千上万甚至更多的复杂任务时，支持此类科学应用程序的工作流系统效率不高。事实证明，调度是在数据密集型科学应用程序中经常执行数千个或更多任务以及管理许多作业对可用资源环境的访问和流动时不可或缺的工作流组件。最近，已证明诸如Hadoop之类的MapReduce系统已成功解决了许多业务数据密集型问题。但是，对于数据密集型科学问题，使用MapReduce系统仍然存在许多限制，主要是因为它们不支持科学特性，例如数据格式，专用数据分析工具（例如数学库），精度以及与非本文提出了MapReduce工作流模型及其运行时系统，该模型使用Hadoop来编排MapReduce作业以处理数据密集型科学工作流，从而解决了其中的一些局限性。在工作流系统中提出了一种基于启发式的调度算法，以管理数据密集型科学应用程序的执行。本文开发了一种基于运行时动态优先级的混合调度算法，比例资源共享技术，可减少可变长度并发任务的延迟，并利用数据局部性。因此，针对MapReduce环境中的一些调度问题，提出了一种新的调度策略，即“平衡接近完成”（BCFF）。调度算法在Hadoop 1.0.1框架中作为新的Hadoop 1.0.1插件调度程序实现。对工作流系统在气候数据处理和分析应用程序（几个TB数据集）上的评估表明，与传统的并行处理方法相比，它是可行的，并且得到了显着改进。该应用程序的科学结果为2002-2011年近十年的全球气候变化监测提供了新的来源。

著录项

作者
Nguyen, Phuong Thi Thu.;
展开▼
作者单位

University of Maryland, Baltimore County.;

展开▼
授予单位 University of Maryland, Baltimore County.;
学科 Computer Science.
学位 Ph.D.
年度 2012
页码 165 p.
总页数 165
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Characterization, modeling and scheduling of power consumption of scientific computing applications in multicores [J] . Murana Jonathan, Nesmachnow Sergio, Armenta Fermin, Cluster computing . 2019 ,第3期

机译：多设备中科学计算应用功耗的特征，建模和调度
2. dispel4py: A Python framework for data-intensive scientific computing [J] . Filguiera Rosa, Krause Amrey, Atkinson Malcolm, Experimental Mechanics . 2017 ,第4期

机译：dispel4py：用于数据密集型科学计算的Python框架
3. ROARS: a robust object archival system for data intensive scientific computing [J] . Hoang Bui, Peter Bui, Patrick Flynn, Distributed and Parallel Databases . 2012 ,第5a6期

机译：ROARS：用于数据密集型科学计算的强大对象归档系统
4. The Large Scale Data Facility: Data Intensive Computing for Scientific Experiments [C] . Garcia Ariel O., Bourov Serguei, Hammad Ahmad, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum . 2011

机译：大规模数据设施：用于科学实验的数据密集型计算
5. Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments. [D] . Bicer, Tekin. 2014

机译：在带宽和空间受限的环境中支持数据密集型科学计算。
6. Indemics: An Interactive High-Performance Computing Framework for Data Intensive Epidemic Modeling [O] . Keith R. Bisset, Jiangzhuo Chen, Suruchi Deodhar, -1

机译：Indemics：用于数据密集型流行病建模的交互式高性能计算框架
7. Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale [O] . Halit Dogan, Masab Ahmad, Brian Kahne, 2019

机译：使用移动计算到1,000核多核尺度的数据模型加速同步

Data intensive scientific compute model for multicore clusters.

摘要

著录项

相似文献

相关主题

期刊订阅