Towards Automated Log Parsing for Large-Scale Log Data Analysis

Pinjia He; Jieming Zhu; Shilin He; Jian Li; Michael R. Lyu

首页> 外文期刊>IEEE transactions on dependable and secure computing >Towards Automated Log Parsing for Large-Scale Log Data Analysis

【24h】

Towards Automated Log Parsing for Large-Scale Log Data Analysis

机译：走向自动日志解析以进行大规模日志数据分析

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Logs are widely used in system management for dependability assurance because they are often the only data available that record detailed system runtime behaviors in production. Because the size of logs is constantly increasing, developers (and operators) intend to automate their analysis by applying data mining methods, therefore structured input data (e.g., matrices) are required. This triggers a number of studies on log parsing that aims to transform free-text log messages into structured events. However, due to the lack of open-source implementations of these log parsers and benchmarks for performance comparison, developers are unlikely to be aware of the effectiveness of existing log parsers and their limitations when applying them into practice. They must often reimplement or redesign one, which is time-consuming and redundant. In this paper, we first present a characterization study of the current state of the art log parsers and evaluate their efficacy on five real-world datasets with over ten million log messages. We determine that, although the overall accuracy of these parsers is high, they are not robust across all datasets. When logs grow to a large scale (e.g., 200 million log messages), which is common in practice, these parsers are not efficient enough to handle such data on a single computer. To address the above limitations, we design and implement a parallel log parser (namely POP) on top of Spark, a large-scale data processing platform. Comprehensive experiments have been conducted to evaluate POP on both synthetic and real-world datasets. The evaluation results demonstrate the capability of POP in terms of accuracy, efficiency, and effectiveness on subsequent log mining tasks.

机译：日志广泛用于系统管理中，以确保可靠性，因为日志通常是唯一的记录生产中详细的系统运行时行为的可用数据。由于日志的大小在不断增加，因此开发人员（和操作员）打算通过应用数据挖掘方法来自动进行分析，因此需要结构化的输入数据（例如矩阵）。这引发了许多有关日志解析的研究，旨在将自由文本日志消息转换为结构化事件。但是，由于缺乏这些日志解析器和性能比较基准的开源实现，因此开发人员在实践中不太可能意识到现有日志解析器的有效性及其局限性。它们通常必须重新实现或重新设计，这既费时又多余。在本文中，我们首先介绍了当前最先进的日志解析器的特性研究，并在具有超过一千万条日志消息的五个真实数据集上评估了它们的功效。我们确定，尽管这些解析器的整体准确性很高，但它们在所有数据集中都不是很可靠的。当日志大规模增长（例如，2亿条日志消息）时（这在实践中很常见），这些解析器的效率不足以在一台计算机上处理此类数据。为了解决上述限制，我们在大型数据处理平台Spark之上设计并实现了并行日志解析器（即POP）。已经进行了综合实验以评估合成数据集和实际数据集上的POP。评估结果证明了POP在后续日志挖掘任务上的准确性，效率和有效性方面的能力。

著录项

来源
《IEEE transactions on dependable and secure computing》 |2018年第6期|931-944|共14页
作者
Pinjia He; Jieming Zhu; Shilin He; Jian Li; Michael R. Lyu;
展开▼
作者单位

Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China;

Huawei 2012 Labs, Huawei, Shenzhen, China;

Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China;

Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China;

Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Runtime; Data analysis; Anomaly detection; Data security; Parallel computing; Cluster approximation;

机译：运行时;数据分析;异常检测;数据安全性;并行计算;集群近似;

相似文献

外文文献
中文文献
专利

1. Evaluating the Condition of Selectively Logged Production Forests in Myanmar: An Analysis Using Large-scale Forest Inventory Data forYedashe Township [J] . Zar Chi Win, Nobuya Mizoue, Tetsuji Ota, Journal of Forest Planning . 2018,第1期

机译：评估缅甸选择性地记录生产森林的条件：使用大规模森林库存数据进行分析镇
2. Detecting and monitoring game bots based on large-scale user-behavior log data analysis in multiplayer online games [J] . Choi YeonJun, Chang SungJune, Kim YongJun, Journal of supercomputing . 2016,第9期

机译：基于多人在线游戏中大规模用户行为日志数据分析的检测和监视游戏机器人
3. In Strategy Simulations, Data Analysis Matters Most (More Than Number of Log Ins and More Than Time Spent Logged In) [J] . Rebecca Schmeller Simulation & Gaming . 2019,第1期

机译：在策略模拟中，最重要的是数据分析（登录次数多，登录时间多）
4. LPV: A Log Parser Based on Vectorization for Offline and Online Log Parsing [C] . Tong Xiao, Zhe Quan, Zhi-Jie Wang, IEEE International Conference on Data Mining . 2020

机译：LPV：基于离线和在线日志解析的矢量化的日志解析器
5. Non-Intrusive, Automated Log Discovery and Parsing [D] . Rodrigues, Kirk Adrian Max. 2018

机译：非侵入性，自动日志发现和解析
6. Identifying Mixture Components From Large-Scale Keystroke Log Data [O] . Tingxuan Li 2021

机译：识别来自大规模击键日志数据的混合组件
7. Tools and Benchmarks for Automated Log Parsing [O] . Jieming Zhu, Shilin He, Jinyang Liu, 2019

机译：自动日志解析的工具和基准
8. Radon data acquisition: An automated system for radon analysis of both ground air and tower air and for the simultaneous logging of meteorological data [R] . Martins, S 1990

机译：氡数据采集：用于地面空气和塔式空气的氡分析以及同时记录气象数据的自动化系统

Towards Automated Log Parsing for Large-Scale Log Data Analysis

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅