首页> 外文期刊>Computers & Security >Addressing the class imbalance problem in Twitter spam detection using ensemble learning
【24h】

Addressing the class imbalance problem in Twitter spam detection using ensemble learning

机译:使用集成学习解决Twitter垃圾邮件检测中的类不平衡问题

获取原文
获取原文并翻译 | 示例

摘要

In recent years, microblogging sites like Twitter have become an important and popular source for real-time information and news dissemination, and they have become a prime target of spammers inevitably. A series of incidents have shown that the security threats caused by Twitter spam can reach far beyond the social media platform to impact the real world. To mitigate the threat, a lot of recent studies apply machine learning techniques to classify Twitter spam and promising results are reported. However, most of these studies overlook the class imbalance problem in real-world Twitter data. In this paper, we experimentally demonstrate that the unequal distribution between spam and non-spam classes has a great impact on spam detection rate. To address the problem, we propose FOS, a fuzzy-based oversampling method that generates synthetic data samples from limited observed samples based on the idea of fuzzy-based information decomposition. Moreover, we develop an ensemble learning approach that learns more accurate classifiers from unbalanced data in three steps. In the first step, the class distribution in the imbalanced data set is adjusted by using various strategies, including random oversampling, random undersampling and FOS. In the second step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine the predictions from all the classification models. We conduct experiments on real-world Twitter data for the purpose of evaluation. The results indicate that the proposed learning approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.
机译:近年来,诸如Twitter之类的微博网站已成为实时信息和新闻传播的重要且流行的来源,并且不可避免地成为垃圾邮件发送者的主要目标。一系列事件表明,Twitter垃圾邮件所造成的安全威胁可能远远超出了社交媒体平台,无法影响现实世界。为了减轻威胁,最近的许多研究都使用机器学习技术对Twitter垃圾邮件进行分类,并报告了令人鼓舞的结果。但是,这些研究大多数都忽略了实际Twitter数据中的类不平衡问题。在本文中,我们通过实验证明了垃圾邮件类别与非垃圾邮件类别之间的不平等分布对垃圾邮件检测率有很大影响。为了解决这个问题,我们提出了FOS,一种基于模糊的过采样方法,该方法基于基于模糊信息分解的思想,从有限的观察样本中生成合成数据样本。此外,我们开发了一种整体学习方法,可通过三个步骤从不平衡数据中学习更准确的分类器。第一步,使用各种策略(包括随机过采样,随机欠采样和FOS)调整不平衡数据集中的类别分布。第二步,在每个重新分配的数据集上建立分类模型。在最后一步中,引入了多数表决方案,以合并来自所有分类模型的预测。我们出于评估目的对真实的Twitter数据进行实验。结果表明,所提出的学习方法可以显着提高类别分布不平衡的数据集中的垃圾邮件检测率。

著录项

  • 来源
    《Computers & Security》 |2017年第8期|35-49|共15页
  • 作者单位

    School of Information Technology, Deakin University, Geelong, Australia;

    School of Information Technology, Deakin University, Geelong, Australia;

    School of Information Technology, Deakin University, Geelong, Australia;

    School of Information Technology, Deakin University, Geelong, Australia;

    School of Information Technology, Deakin University, Geelong, Australia;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Online social networks; Twitter; Spam detection; Machine learning; Class imbalance;

    机译:在线社交网络;推特;垃圾邮件检测;机器学习;班级失衡;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号