D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

机译：D3-MapReduce：面向分布式和动态数据集的MapReduce

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Since its introduction in 2004 by Google, MapReduce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices, ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D3-MapReduce environment.

机译：自2004年Google推出以来，MapReduce已成为处理大型数据集的首选编程模型。尽管MapReduce最初是为Web企业在大型数据中心中使用而开发的，但该技术因其在大型并行数据分析（包括地理，高能物理，基因组学等）中的适用性而受到了科学界的广泛关注。到目前为止，MapReduce主要用于批量处理批量数据。 D3-MapReduce的目标是扩展MapReduce编程模型并提出该模型的有效实现，以：i）处理分布式数据集，即跨多个分布式基础结构或存储在松散连接的设备的网络上； ii）处理动态数据集，即随时间动态变化的数据集，或者是不完整的或部分可用的数据集。在本文中，我们为实现这一宏伟目标指明了道路。我们的方法利用数据生命周期作为关键概念，为异构和分布式基础架构上的分布式和动态数据集提供MapReduce。我们首先报告一下我们为混合分布式计算基础架构（Hybrid DCI）实现MapReduce编程模型的尝试。我们介绍了基于BitDew（一种用于大规模数据管理的中间件）和Active Data（一种用于数据生命周期管理的编程模型）的原型体系结构。其次，我们概述了方法论方面的挑战，并在Grid'5000实验测试台上基于仿真和仿真提出了我们的方法。我们进行性能评估，并将我们的原型与Hadoop（行业参考MapReduce实施）进行比较。我们介绍了有关动态数据集的正在进行的工作，这些工作已导致我们实现了增量MapReduce框架。最后，我们讨论我们的成就并概述在获得完整的D3-MapReduce环境之前仍需要解决的挑战。

著录项

来源
《2015 IEEE International Conference on Smart City》|2015年|637-642|共6页
会议地点 Chengdu(CN)
作者
Haiwu He; Anthony Simonet; Julio Anjos Jose-Francisco Saray; Gilles Fedak; Bing Tang; Lu Lu; Xuanhua Shi; Hai Jin; Mircea Moca; Gheorghe Cosmin Silaghi; Asma Ben Cheikh; Heithem Abbes;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Data management; Hybrid Computing Infrastructure; Incremental Processing; MapReduce;

机译：数据管理;混合计算基础设施;增量处理; MapReduce;

相似文献

外文文献
中文文献
专利

1. Hengam a MapReduce-Based Distributed Data Warehouse for Big Data: A MapReduce-Based Distributed Data Warehouse for Big Data [J] . Mohammadhossein Barkhordari, Mahdi Niamanesh International journal of artificial life research . 2018,第1期

机译：Hengam基于MapReduce的大数据分布式数据仓库：基于MapReduce的大数据分布式数据仓库
2. Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce [J] . Yanfeng Zhang, Shimin Chen, Ge Yu IEEE Transactions on Knowledge and Data Engineering . 2016,第12期

机译：在MapReduce中对大型数据集进行聚类的有效分布式密度峰
3. Cross-MapReduce: Data transfer reduction in geo-distributed MapReduce [J] . Saeed Mirpour Marzuni, Abdorreza Savadi, Adel N.Toosi, Future generation computer systems . 2021,第Feba期

机译：Cross-MapReduce：地理分布式MapReduce中的数据传输减少
4. D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets [C] . Haiwu He, Anthony Simonet, Julio Anjos Jose-Francisco Saray, IEEE International Conference on Smart City . 2015

机译：D3-MapReduce：用于分布式和动态数据集的MapReduce
5. Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm [D] . Zhang, Boyu 2015

机译：在MapReduce范例支持的大型分布式存储系统上，对大型计算结构生物学数据集启用可伸缩数据分析
6. Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data [O] . Uma S. Mudunuri, Mohamad Khouja, Stephen Repetski, -1

机译：使用分布式查询跨非常大的生物数据集进行知识和主题发现：结合非结构化和结构化数据的原型
7. D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets [O] . He Haiwu, Simonet Anthony, Anjos Julio, 2015

机译：D 3 -MapReduce：面向分布式和动态数据集的MapReduce

D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

摘要

著录项

相似文献

相关主题

期刊订阅