Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings

机译：轻量级互联网流量分类：带有词嵌入的基于主题的解决方案

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application- level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word- embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.

机译：无论如何，互联网流量分类是一个相关且成熟的研究领域，这也归因于互联网连接设备在日常生活中的普遍存在，但它的重要性日益提高且仍面临开放的技术挑战。我们声称需要创新的流量分类解决方案，这些解决方案必须轻巧，采用基于域的方法，不仅要专注于应用程序级别协议分类，而且还要按主题对Internet流量进行分类。为此，本文最初提出了一种分类解决方案，该方案利用从IPFIX摘要，DNS日志和DHCP租约中提取的域名信息，并有可能应用于任何流量。我们提出的解决方案基于对运行在专用Apache Spark集群上的Word2vec无监督学习技术的扩展。特别是，利用学习技术可以以轻量级的方式从通用域名和自然语言语料库组成的混合数据集中生成词嵌入。本文还报告了从我们的实施和部署经验中吸取的教训，这些经验表明，我们的解决方案可以在Amazon EC2中具有1个从属实例的Apache Spark集群上每秒处理5500个IPFIX摘要，每年的费用为3860美元。报告的关于精度，召回率，F量度，准确性和Cohen的Kappa的实验结果表明了该建议的可行性和有效性。实验证明，域名中包含的单词确实与指向它们的流量类型有关，因此，使用经过专门训练的单词嵌入，我们可以将其分类为可自定义的类别。我们还表明，在较大的自然语言语料库上训练单词嵌入可将精度提高多达180％。

著录项

来源
《International Conference on Smart Computing》|2016年|1-8|共8页
会议地点
作者
Antonio Murgia; Giacomo Ghidini; Stephen P. Emmons; Paolo Bellavista;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Human trajectory data and internet traffic mining using improved multi-context trajectory embedding service usage classification model [J] . Suryakumar B, Dr. Ramadevi E International Journal of Engineering & Technology . 2018,第4期

机译：使用改进的多上下文轨迹嵌入服务使用分类模型的人类轨迹数据和互联网流量挖掘
2. A streaming flow-based technique for traffic classification applied to 12+1 years of Internet traffic [J] . Carela-Espanol Valentin, Barlet-Ros Pere, Bifet Albert, Telecommunication systems: Modeling, Analysis, Design and Management . 2016,第2期

机译：一种基于流的流量分类技术，适用于12 + 1年的Internet流量
3. A lightweight model with spatial-temporal correlation for cellular traffic prediction in Internet of Things [J] . Chien Wei-Che, Huang Yueh-Min Journal of supercomputing . 2021,第9期

机译：一种轻量级模型，具有互联网跨越蜂窝交通预测的空间关联
4. Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings [C] . Antonio Murgia, Giacomo Ghidini, Stephen P. Emmons, International Conference on Smart Computing . 2016

机译：轻量级互联网流量分类：基于主题的embeddings解决方案
5. Internet and Tor Traffic Classification Using Machine Learning [D] . Palsambkar, Siddharth 2019

机译：使用机器学习的Internet和Tor交通分类
6. A Word on Words in Words: How Do Embedded Words Affect Reading? [O] . Joshua Snell, Jonathan Grainger, Mathieu Declerck 2018

机译：单词中的单词：嵌入式单词如何影响阅读？
7. Lightweight Internet Traffic Classification based on Packet Level Hidden Markov Models [O] . Naveed Akhtar, Muhammad Kamran 2017

机译：轻量级互联网流量分类基于数据包级隐藏马尔可夫模型

Lightweight Internet Traffic Classification: A Subject-Based Solution with Word Embeddings

摘要

著录项

相似文献

相关主题

期刊订阅