首页> 外文OA文献 >Improving Hadoop Performance by Using Metadata of Related Jobs in Text Datasets Via Enhancing MapReduce Workflow
【2h】

Improving Hadoop Performance by Using Metadata of Related Jobs in Text Datasets Via Enhancing MapReduce Workflow

机译:通过增强mapReduce工作流,在文本​​数据集中使用相关作业的元数据来提高Hadoop性能

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Cloud Computing provides different services to the users with regard to processing data. One of the main concepts in Cloud Computing is BigData and BigData analysis. BigData is a complex, un-structured or very large size of data. Hadoop is a tool or an environment that is used to process BigData in parallel processing mode. The idea behind Hadoop is, rather than send data to the servers to process. Hadoop divides a job into small tasks and sends them to servers. These servers contain data, process the tasks and send the results back to the master node in Hadoop. Hadoop contains some limitations that could be developed to have a higher performance in executing jobs. These limitations are mostly because of data locality in the cluster, jobs and tasks scheduling, CPU execution time, or resource allocations in Hadoop. Data locality and efficient resource allocation remains a challenge in cloud computing MapReduce platform. We propose an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis. At the same time, the proposed architecture addresses the issue of resource allocation in native Hadoop. The proposed architecture provides an efficient distributed clustering approach for dedicated cloud computing environments. Enhanced Hadoop architecture leverages on NameNode’s ability to assign jobs to the TaskTrakers (DataNodes) within the cluster. By adding controlling features to the NameNode, it can intelligently direct and assign tasks to the DataNodes that contain the required data. Our focus is on extracting features and building a metadata table that carries information about the existence and the location of the data blocks in the cluster. This enables NameNode to direct the jobs to specific DataNodes without going through the whole data sets in the cluster. It should be noted that newly build lookup table is an addition to the metadata table that already exists in the native Hadoop. Our development is about processing real text in text data sets that might be readable such as books, or not readable such as DNA data sets. To test the performance of proposed architecture, we perform DNA sequence matching and alignment of various short genome sequences. Comparing with native Hadoop, proposed Hadoop reduced CPU time, number of read operations, input data size, and another different factors.
机译:云计算在处理数据方面为用户提供了不同的服务。云计算中的主要概念之一是BigData和BigData分析。 BigData是复杂的,非结构化的或非常大的数据。 Hadoop是一种用于以并行处理模式处理BigData的工具或环境。 Hadoop背后的想法是,而不是将数据发送到服务器进行处理。 Hadoop将作业划分为小任务,并将其发送到服务器。这些服务器包含数据,处理任务并将结果发送回Hadoop中的主节点。 Hadoop包含一些限制,这些限制可以发展为在执行作业时具有更高的性能。这些限制主要是由于群集中的数据局部性,作业和任务调度,CPU执行时间或Hadoop中的资源分配。数据局部性和有效的资源分配仍然是云计算MapReduce平台中的挑战。我们提出了一种增强的Hadoop架构,可以减少与BigData分析相关的计算成本。同时,提出的体系结构解决了本地Hadoop中的资源分配问题。所提出的体系结构为专用云计算环境提供了一种有效的分布式集群方法。增强的Hadoop架构利用NameNode的能力将作业分配给群集中的TaskTrakers(DataNodes)。通过向NameNode添加控制功能,它可以智能地将任务分配给包含所需数据的DataNode并将其分配给DataNode。我们的重点是提取功能并构建元数据表,该表携带有关集群中数据块的存在和位置的信息。这使NameNode可以将作业定向到特定的DataNode,而无需遍历群集中的整个数据集。应该注意的是,新建查找表是对本机Hadoop中已经存在的元数据表的补充。我们的发展是关于处理文本数据集中的真实文本,这些文本数据集可能是可读的(例如书籍),或者是不可读的(例如DNA数据集)。为了测试提出的体系结构的性能,我们执行DNA序列匹配和各种短基因组序列的比对。与本机Hadoop相比,建议的Hadoop减少了CPU时间,读取操作数,输入数据大小以及其他不同的因素。

著录项

  • 作者

    Alshammari Hamoud H.;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en_US
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号