首页> 外文会议>2015 IEEE International Conference on Smart City >D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets
【24h】

D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

机译:D3-MapReduce:面向分布式和动态数据集的MapReduce

获取原文
获取原文并翻译 | 示例

摘要

Since its introduction in 2004 by Google, MapReduce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices, ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D3-MapReduce environment.
机译:自2004年Google推出以来,MapReduce已成为处理大型数据集的首选编程模型。尽管MapReduce最初是为Web企业在大型数据中心中使用而开发的,但该技术因其在大型并行数据分析(包括地理,高能物理,基因组学等)中的适用性而受到了科学界的广泛关注。到目前为止,MapReduce主要用于批量处理批量数据。 D3-MapReduce的目标是扩展MapReduce编程模型并提出该模型的有效实现,以:i)处理分布式数据集,即跨多个分布式基础结构或存储在松散连接的设备的网络上; ii)处理动态数据集,即随时间动态变化的数据集,或者是不完整的或部分可用的数据集。在本文中,我们为实现这一宏伟目标指明了道路。我们的方法利用数据生命周期作为关键概念,为异构和分布式基础架构上的分布式和动态数据集提供MapReduce。我们首先报告一下我们为混合分布式计算基础架构(Hybrid DCI)实现MapReduce编程模型的尝试。我们介绍了基于BitDew(一种用于大规模数据管理的中间件)和Active Data(一种用于数据生命周期管理的编程模型)的原型体系结构。其次,我们概述了方法论方面的挑战,并在Grid'5000实验测试台上基于仿真和仿真提出了我们的方法。我们进行性能评估,并将我们的原型与Hadoop(行业参考MapReduce实施)进行比较。我们介绍了有关动态数据集的正在进行的工作,这些工作已导致我们实现了增量MapReduce框架。最后,我们讨论我们的成就并概述在获得完整的D3-MapReduce环境之前仍需要解决的挑战。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号