【24h】

PoN: Open source solution for real-time data analysis

机译:PON:用于实时数据分析的开源解决方案

获取原文

摘要

With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.
机译:随着快速的创新和互联网人口,每秒都会产生卑鄙的信息。处理这些巨大的数据和分析现在是一个繁琐的过程 - 日期。实时数据的数量正在巨大增长。近80%的数据是非结构化格式。实时分析非结构化数据是一个非常具有挑战性的任务。现有的传统商业智能(BI)工具仅在预定义的架构中表现。大多数实时数据都是日志,没有任何定义的架构。在这些大型数据集上执行查询需要很长时间。在实时数据的流中,从在系统中引起开销的数据源中提取了大量不需要的信息。这导致建筑和维护成本增加。每一秒,新数据流都在系统中一直在系统中累积,而是关于世界上的内容。收集这些数据和处理是了解重要报告的重要技能。在本文中,我们提出了一段新闻(PON)端到端解决方案,其中我们使用适当的Hadoop组件进行实时数据分析。我们的目标是从正常新闻数据中提取健康数据,以便我们可以立即预测任何实时突破。我们而不是收集所有新闻,我们只过滤了基于某些阈值的重要消息,从而降低了成本。我们将历史数据与实时数据进行比较,导致迅速采取行动,因为我们已经知道了先前数据的爆发。领先一步,我们甚至可以在世界上其他任何人之前发现任何危险的疫情。不仅我们使用Hadoop Componants进行实时分析,而且我们使用Hive和Pig对收集的新闻数据集进行查询。最后,我们提出了他们的表现比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号