首页> 外文会议>International Teletraffic Congress >CLUE: Clustering for Mining Web URLs
【24h】

CLUE: Clustering for Mining Web URLs

机译:线索:用于挖掘Web URL的聚类

获取原文

摘要

The Internet has witnessed the proliferation of applications and services that rely on HTTP as application protocol. Users play games, read emails, watch videos, chat and access web pages using their PC, which in turn downloads tens or hundreds of URLs to fetch all the objects needed to display the requested content. As result, billions of URLs are observed in the network. When monitoring the traffic, thus, it is becoming more and more important to have methodologies and tools that allow one to dig into this data and extract useful information. In this paper, we present CLUE, Clustering for URL Exploration, a methodology that leverages clustering algorithms, i.e., unsupervised techniques developed in the data mining field to extract knowledge from passive observation of URLs carried by the network. This is a challenging problem given the unstructured format of URLs, which, being strings, call for specialized approaches. Inspired by text-mining algorithms, we introduce the concept of URL-distance and use it to compose clusters of URLs using the well-known DBSCAN algorithm. Experiments on actual datasets show encouraging results. Well-separated and consistent clusters emerge and allow us to identify, e.g., malicious traffic, advertising services, and thirdparty tracking systems. In a nutshell, our clustering algorithm offers the means to get insights on the data carried by the network, with applications in the security or privacy protection fields.
机译:互联网目睹了依赖HTTP作为应用程序协议的应用程序和服务的扩散。用户使用其PC播放游戏,阅读电子邮件,观看视频,聊天和访问网页,这反过来下载数十或数百个URL来获取显示所请求内容所需的所有对象。结果,在网络中观察到数十亿个URL。监视流量时,具有允许挖掘此数据的方法和工具并提取有用信息越来越重要。在本文中,我们呈现了URL勘探的Clue,一种方法,一种利用聚类算法,即数据挖掘领域开发的无监督技术,以从网络携带的基座的被动观察中提取知识。这是一个具有挑战性的问题,给出了URL的非结构化格式,它是字符串,呼叫专门方法。灵感来自文本挖掘算法,我们介绍了URL距离的概念,并使用它使用众所周知的DBSCAN算法来构思URL集群。实际数据集的实验显示令人鼓舞的结果。分离良好和一致的群集出现并允许我们识别,例如恶意交通,广告服务和第三场跟踪系统。简而言之,我们的聚类算法提供了在安全性或隐私保护字段中的应用程序对网络携带的数据进行了解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号