...
首页> 外文期刊>International journal of grid and high performance computing >Flexible MapReduce Workflows for Cloud Data Analytics
【24h】

Flexible MapReduce Workflows for Cloud Data Analytics

机译:灵活的MapReduce工作流程,用于云数据分析

获取原文
获取原文并翻译 | 示例

摘要

Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and their experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.
机译:数据分析应用程序处理受多个处理阶段约束的大型数据集,其中某些阶段可以在集群,网格或云上并行执行。此类应用程序可受益于使用MapReduce模型,仅要求最终用户定义用于输入数据处理以及map和reduce函数的应用程序算法,但这需要在其中安装/配置特定框架,例如Apache Hadoop或Elastic MapReduce。亚马逊云。为了在定义和调整应用程序配置以及应用程序阶段的组成及其编排规范方面提供更大的灵活性,作者描述了一种在AWARD框架中将MapReduce阶段作为子工作流支持的方法(自动工作流程活动可重新配置和动态)。作者讨论了如何将文本挖掘应用程序表示为具有多个阶段的复杂工作流,其中各个工作流节点都支持MapReduce计算。数据共享抽象支持对在MapReduce计算期间生成的中间数据的访问。作者描述了这种抽象的两种实现,一种基于共享的元组空间,另一种基于内存中的分布式键/值存储。作者描述了该框架的实现,一组开发的工具,以及他们在多个Amazon EC2(弹性计算云)实例上执行文本挖掘算法的实验,并报告了获得的加速和放大结果最多20个EC2实例,并且针对不同的语料库大小,最多9700万个单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号