首页> 外文学位 >Data intensive scientific compute model for multicore clusters.
【24h】

Data intensive scientific compute model for multicore clusters.

机译:多核集群的数据密集型科学计算模型。

获取原文
获取原文并翻译 | 示例

摘要

Data intensive computing holds the promise of major scientific breakthroughs and discoveries from the exploration and mining of the massive data sets becoming available to the science community. This expectation has led to tremendous increases in data intensive scientific applications. However, data intensive scientific applications still face severe challenges in accessing, managing and analyzing petabytes of data. In particular, workflow systems to support such scientific applications are not as efficient when dealing with thousands and even more of complex tasks within jobs that operate across high performance large multicore clusters with very large amounts of streaming data. Scheduling, it turns out, is an integral workflow component in the execution often of thousands or more tasks within a data intensive scientific application as well as in managing the access and flow of many jobs to the available resource environment. Recently, MapReduce systems such as Hadoop, have proven successful for many business data intensive problems. However, there are still many limitations in the use of MapReduce systems for data-intensive scientific problems mainly because they do not support the characteristics of science such as data formats, specialized data analytic tools (e.g. math libraries), accuracies, and interfaces with non MapReduce components.;This thesis addresses some of these limitations by proposing a MapReduce workflow model and its runtime system using Hadoop for orchestrating MapReduce jobs for data intensive scientific workflows. A heuristic based scheduling algorithm is proposed in the workflow system to manage the execution of data intensive scientific applications. This thesis has developed a hybrid-scheduling algorithm based on run time dynamic priorities, proportional resource sharing techniques that reduce delays for variable length concurrent tasks, and takes advantage of data locality. As a result, a new scheduling policy, Balanced Closer to Finish First (BCFF), is proposed as solutions for some problems of scheduling in MapReduce environment. The scheduling algorithm is implemented in Hadoop 1.0.1 framework as a new Hadoop 1.0.1 plug-in Scheduler. The evaluations of the workflow system on the climate data processing and analysis application (several TB dataset) show that it is feasible and significantly improved compared to traditional parallel processing method. The scientific results of the application provide new source of monitoring global climate changes for the near decade 2002-2011.
机译:数据密集型计算有望通过科学界对海量数据集的探索和挖掘获得重大科学突破和发现。这种期望导致数据密集型科学应用的巨大增长。但是,数据密集型科学应用程序在访问,管理和分析PB级数据方面仍然面临严峻挑战。特别是,在处理跨具有大量流数据的高性能大型多核群集运行的作业中成千上万甚至更多的复杂任务时,支持此类科学应用程序的工作流系统效率不高。事实证明,调度是在数据密集型科学应用程序中经常执行数千个或更多任务以及管理许多作业对可用资源环境的访问和流动时不可或缺的工作流组件。最近,已证明诸如Hadoop之类的MapReduce系统已成功解决了许多业务数据密集型问题。但是,对于数据密集型科学问题,使用MapReduce系统仍然存在许多限制,主要是因为它们不支持科学特性,例如数据格式,专用数据分析工具(例如数学库),精度以及与非本文提出了MapReduce工作流模型及其运行时系统,该模型使用Hadoop来编排MapReduce作业以处理数据密集型科学工作流,从而解决了其中的一些局限性。在工作流系统中提出了一种基于启发式的调度算法,以管理数据密集型科学应用程序的执行。本文开发了一种基于运行时动态优先级的混合调度算法,比例资源共享技术,可减少可变长度并发任务的延迟,并利用数据局部性。因此,针对MapReduce环境中的一些调度问题,提出了一种新的调度策略,即“平衡接近完成”(BCFF)。调度算法在Hadoop 1.0.1框架中作为新的Hadoop 1.0.1插件调度程序实现。对工作流系统在气候数据处理和分析应用程序(几个TB数据集)上的评估表明,与传统的并行处理方法相比,它是可行的,并且得到了显着改进。该应用程序的科学结果为2002-2011年近十年的全球气候变化监测提供了新的来源。

著录项

  • 作者

    Nguyen, Phuong Thi Thu.;

  • 作者单位

    University of Maryland, Baltimore County.;

  • 授予单位 University of Maryland, Baltimore County.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 165 p.
  • 总页数 165
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号