PoN: Open source solution for real-time data analysis

机译：PON：用于实时数据分析的开源解决方案

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.

机译：随着快速的创新和互联网人口，每秒都会产生卑鄙的信息。处理这些巨大的数据和分析现在是一个繁琐的过程 - 日期。实时数据的数量正在巨大增长。近80％的数据是非结构化格式。实时分析非结构化数据是一个非常具有挑战性的任务。现有的传统商业智能（BI）工具仅在预定义的架构中表现。大多数实时数据都是日志，没有任何定义的架构。在这些大型数据集上执行查询需要很长时间。在实时数据的流中，从在系统中引起开销的数据源中提取了大量不需要的信息。这导致建筑和维护成本增加。每一秒，新数据流都在系统中一直在系统中累积，而是关于世界上的内容。收集这些数据和处理是了解重要报告的重要技能。在本文中，我们提出了一段新闻（PON）端到端解决方案，其中我们使用适当的Hadoop组件进行实时数据分析。我们的目标是从正常新闻数据中提取健康数据，以便我们可以立即预测任何实时突破。我们而不是收集所有新闻，我们只过滤了基于某些阈值的重要消息，从而降低了成本。我们将历史数据与实时数据进行比较，导致迅速采取行动，因为我们已经知道了先前数据的爆发。领先一步，我们甚至可以在世界上其他任何人之前发现任何危险的疫情。不仅我们使用Hadoop Componants进行实时分析，而且我们使用Hive和Pig对收集的新闻数据集进行查询。最后，我们提出了他们的表现比较。

著录项

来源
《Conference on Digital Information Processing, Data Mining, and Wireless Communications》|2016年|335 p. :|共6页
会议地点
作者
Nikitha Johnsirani Venkatesan; Earl Kim; Dong Ryeol Shin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息处理（信息加工）;
关键词
Big data; Sparks; Real-time systems; Organizations; Google; Computers; Medical services;

机译：大数据;火花;实时系统;组织;谷歌;计算机;医疗服务;

相似文献

外文文献
中文文献
专利

1. Concentrations and sources of non-methane hydrocarbons (NMHCs) from 2005 to 2013 in Hong Kong: A multi-year real-time data analysis [J] . Jiamin Ou, Hai Guo, Junyu Zheng, Atmospheric environment . 2015,第feba期

机译：2005年至2013年香港非甲烷碳氢化合物（NMHC）的浓度和来源：多年实时数据分析
2. Big data analytics using Splunk: deriving operational intelligence from social media, machine data, existing data warehouses, and other real-time streaming sources [J] . Alessandro Berni Computing reviews . 2014,第5期

机译：使用Splunk进行大数据分析：从社交媒体，机器数据，现有数据仓库和其他实时流源中获取运营情报
3. Joint analysis of geodetic and earthquake fault-plane solution data to constrain magmatic sources: A case study from Kilauea Volcano [J] . Wauthier Christelle, Roman Diana C., Poland Michael P. Earth and Planetary Science Letters: A Letter Journal Devoted to the Development in Time of the Earth and Planetary System . 2016,第Null期

机译：大地测量和地震断层平面解数据的联合分析以限制岩浆源：以基拉韦厄火山为例
4. PoN: Open source solution for real-time data analysis [C] . Nikitha Johnsirani Venkatesan, Earl Kim, Dong Ryeol Shin 2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications . 2016

机译：PoN：用于实时数据分析的开源解决方案
5. Real-Time Query Systems for Complex Data Sources. [D] . Rose, Ian Thomas. 2011

机译：复杂数据源的实时查询系统。
6. IOCBIO Kinetics: An open-source software solution for analysis of data traces [O] . Marko Vendelin, Martin Laasmaa, Mari Kalda, 2020

机译：IOCBIO动力学：用于分析数据迹线的开源软件解决方案
7. Self-Contained Information Resource (SCIR) for Automated Real-time Data Acquisition, Data Archival, Data Analysis, and Data Exploitation of Ground Truth Radiometric Signatures from Scaled Ordnance (SCALO) High Explosive Events [R] . Boye, L. , Herther, T. , Harris, C. , 1999

机译：自包含信息资源（sCIR），用于自动实时数据采集，数据存档，数据分析以及来自规模化军械（sCaLO）高爆炸事件的地面真实辐射特征的数据利用

PoN: Open source solution for real-time data analysis

摘要

著录项

相似文献

相关主题

期刊订阅