首页> 外文期刊>SIGKDD explorations >Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity
【24h】

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

机译:加密恶意软件流量分类的机器学习:嘈杂标签和非实用性的会计

获取原文
获取原文并翻译 | 示例
       

摘要

The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.
机译:机器学习在过去几十年中已经很好地研究了恶意网络交通的检测;当流量加密时,它特别吸引,因为不能使用传统的模式匹配方法。不幸的是,在网络安全域中实现了机器学习的承诺。在本文中,我们突出了两种主要原因,为什么这是如此:不准确的地面真理和高稳定性的数据分布。为了证明和理解这些陷阱对流行的机器学习算法的影响,我们设计和执行实验,显示六种常见算法在面对真实网络数据时如何执行。凭借我们的实验结果,我们确定了某些类别算法在加密恶意软件流量分类任务上表现的某些算法的情况。我们为从业者提供了具体的建议,因为概述了真实的束缚。从算法的角度来看,我们发现随机森林集合方法表现优于竞争方法。更重要的是,特征工程是决定性的;我们发现迭代初始功能集,包括域专家建议的功能,对分类系统的性能产生了更大的影响。例如,使用更快乐的功能设置的线性回归容易表现出使用标准网络流量表示的随机森林方法,在考虑的所有标准上。我们的分析基于数百万的TLS加密会议,从商业恶意软件沙箱和两个地理上不同的大型企业网络收集超过12个月。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号