首页> 外文会议>International Conference on Smart Computing >Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings
【24h】

Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings

机译:轻量级互联网流量分类:带有词嵌入的基于主题的解决方案

获取原文

摘要

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application- level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word- embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.
机译:无论如何,互联网流量分类是一个相关且成熟的研究领域,这也归因于互联网连接设备在日常生活中的普遍存在,但它的重要性日益提高且仍面临开放的技术挑战。我们声称需要创新的流量分类解决方案,这些解决方案必须轻巧,采用基于域的方法,不仅要专注于应用程序级别协议分类,而且还要按主题对Internet流量进行分类。为此,本文最初提出了一种分类解决方案,该方案利用从IPFIX摘要,DNS日志和DHCP租约中提取的域名信息,并有可能应用于任何流量。我们提出的解决方案基于对运行在专用Apache Spark集群上的Word2vec无监督学习技术的扩展。特别是,利用学习技术可以以轻量级的方式从通用域名和自然语言语料库组成的混合数据集中生成词嵌入。本文还报告了从我们的实施和部署经验中吸取的教训,这些经验表明,我们的解决方案可以在Amazon EC2中具有1个从属实例的Apache Spark集群上每秒处理5500个IPFIX摘要,每年的费用为3860美元。报告的关于精度,召回率,F量度,准确性和Cohen的Kappa的实验结果表明了该建议的可行性和有效性。实验证明,域名中包含的单词确实与指向它们的流量类型有关,因此,使用经过专门训练的单词嵌入,我们可以将其分类为可自定义的类别。我们还表明,在较大的自然语言语料库上训练单词嵌入可将精度提高多达180%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号