首页> 外文期刊>Engineering Applications of Artificial Intelligence >Towards benchmark datasets for machine learning based website phishing detection: An experimental study
【24h】

Towards benchmark datasets for machine learning based website phishing detection: An experimental study

机译:对于基于机器学习的网站网络钓鱼检测的基准数据集:实验研究

获取原文
获取原文并翻译 | 示例

摘要

The increasing popularity of the Internet led to a substantial growth of e-commerce. However, such activities have main security challenges primary caused by cyberfraud and identity theft. Therefore, checking the legitimacy of visited web pages is a crucial task to secure costumers' identities and prevent phishing attacks. The use of machine learning is widely recognized as a promising solution. The literature is rich with studies that use machine learning techniques for website phishing detection. However, their findings are dataset dependent and are far away from generalization. Two main reasons for this unfortunate state are the impracticable replication and absence of appropriate benchmark datasets for fair evaluation of systems. Moreover, phishing tactics are continuously evolving and proposed systems are not following those rapid changes. In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems adopting different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined categorization of website phishing features and we systematically select a total of 87 commonly recognized ones, we categorize them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual and combined categories of features, we investigate different combinations of models, and we explore the effects of filter and wrapper methods on the selection of discriminative features. The results show that Random Forest is the most predictive classifier. Features gathered from external services are the most discriminative where features extracted from web page contents are less distinguishing. Besides external service based features, some web page content features are found not suitable for runtime detection. The use of hybrid features provided the best accuracy score of 96.61%. By investigating different feature selection methods, filter-based ranking with incremental removal of less important features improved the performance up to 96.83% better than wrapper methods.
机译:互联网的日益普及,导致了电子商务的大幅增长。然而,这样的活动有主要安全挑战主要引起cyberfraud和身份盗窃。因此,检查访问的网页的合法性是一个极为重要的任务,以安全的costumers的身份,防止网络钓鱼攻击。使用机器学习被广泛认为是一个可行的解决方案。文献具有丰富的研究,对网站的使用机器学习技术,网络钓鱼检测。然而,他们的研究结果的数据集依赖和远离概括。这个不幸的国家主要有两个原因是适当的基准数据集的系统公正的评价架空复制和缺失。此外,网络钓鱼战术不断发展,并提出系统没有关注那些快速变化。在本文中,我们提出了建立可重复和可扩展的数据集对网站的网络钓鱼检测的通用方案。其目的是(1)使能的系统相比,采用不同的特征,(2)超车钓鱼网站的短暂性质,和网络钓鱼战术的演变的(3)跟踪。对于实验所提出的方案,我们采用的网络钓鱼网站的功能细化的分类开始,我们系统地选择共87分公认的,我们将它们分类,我们让他们科目针对性和运行时分析。我们使用收集的功能集光所提出的方案,以建立一个数据集。此后,我们使用的概念复制的方法来检查前发现的通用性的内置数据集。具体而言,我们评估个人和组合类的功能分类的性能,我们研究的模型不同的组合,和大家探讨一下过滤和包装上的判别特征选择方法的效果。结果表明,随机森林是最有预测分类。从外部服务聚集功能是区别最大的网页内容中提取其中的特点是少区分。除了外部的基于服务的功能,有些网页内容的功能被发现不适合运行检测。使用混合的功能提供了96.61%的最好的准确度得分。通过研究不同的特征选择方法,过滤器为基础的增量去除不太重要的功能,提高排名的表现达96.83%,比包装方法更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号