首页> 外文学位 >Distributed frameworks towards building an open data architecture.
【24h】

Distributed frameworks towards building an open data architecture.

机译:用于构建开放数据架构的分布式框架。

获取原文
获取原文并翻译 | 示例

摘要

Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model.;This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark.
机译:数据无处不在。数字,社交媒体当前的技术进步以及与各种系统进行交互的不同应用程序服务的易用性正导致产生大量数据。由于服务的多样性,数据格式现在不仅限于文本之类的结构类型,还可以生成非结构化的内容,如社交媒体数据,视频和图像等。所生成的数据没有任何用处,除非对其进行存储和分析以得出一定的价值。传统的数据库系统在数据格式模式,访问速率和存储大小等方面都受到限制。Hadoop是Apache开源分布式框架,支持在其名为Hadoop File System(HDFS)的文件系统上可靠地存储大量不同格式数据的数据集。本文主要研究使用Hadoop及其相关的开源分布式框架构建数据体系结构,以支持低价硬件上的数据流管道。数据流组件包括采购数据,HDFS和数据访问层上的存储管理。本研究还讨论了利用架构组件的用例。 Sqoop是一个将结构化数据从数据库吸收到Hadoop和Flume的框架,用于将半结构化Twitter流json数据吸收到HDFS上进行分析。已使用Hive对使用Sqoop和Flume的数据进行了分析,以进行类似SQL的分析,并且在更高级别的数据访问层,将Hadoop与使用Spark的内存计算系统进行了比较。在使用Hadoop和Spark框架时,已经分析了查询执行性能的显着差异。这种集成有助于摄取大量的流json多样性数据流,从而使用Hive和Spark更好地基于价值进行分析。

著录项

  • 作者

    Venumuddala, Ramu Reddy.;

  • 作者单位

    University of North Texas.;

  • 授予单位 University of North Texas.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2015
  • 页码 58 p.
  • 总页数 58
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:52:20

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号