首页> 外文会议>IEEE International Conference on Intelligence and Security Informatics >Prioritized active learning for malicious URL detection using weighted text-based features
【24h】

Prioritized active learning for malicious URL detection using weighted text-based features

机译:使用基于文本的加权功能对恶意URL检测进行优先级主动学习

获取原文

摘要

Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.
机译:数据分析正越来越多地用于网络安全问题中,并发现在数据量和异构性使其难以由安全专家进行手动评估的情况下很有用。在涉及数据驱动分析的实际网络安全方案中,对于许多受监管的安全分析任务而言,获取带有批注(即真实标签)的数据是一项具有挑战性且已知的限制因素。大型数据集的重要部分通常不加标签,因为注释的任务是广泛的手动操作,需要大量的专家干预。在本文中,我们提出了一种有效的主动学习方法,该方法可以有效解决网络钓鱼分类的实际网络安全问题中的这一限制,从而使用人机协作方法来设计半监督解决方案。在少量带注释的数据上学习初始分类器,然后以迭代的方式逐步更新初始分类器,方法是从大量未标记数据中仅筛选出最有可能会快速影响分类器性能的相关样本,然后逐步对其进行更新。优先主动学习显示了在批处理学习框架中实现分类性能方面更快收敛的显着希望,因此需要更少的人工注释工作。一种有用的特征权重更新技术与主动学习相结合,显示了很有前景的分类性能,可用于对网络钓鱼/恶意URL进行分类,而无需在培训期间使用大量带注释的培训样本。在使用PhishMonger的“目标品牌”数据集的多个集合进行的实验中,所提出的方法显示出比基线高出多达12%的显着改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号