首页> 外文期刊>Computational Social Systems, IEEE Transactions on >A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection
【24h】

A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection

机译:基于机器学习的流式垃圾邮件鸣叫检测性能评估

获取原文
获取原文并翻译 | 示例

摘要

The popularity of Twitter attracts more and more spammers. Spammers send unwanted tweets to Twitter users to promote websites or services, which are harmful to normal users. In order to stop spammers, researchers have proposed a number of mechanisms. The focus of recent works is on the application of machine learning techniques into Twitter spam detection. However, tweets are retrieved in a streaming way, and Twitter provides the Streaming API for developers and researchers to access public tweets in real time. There lacks a performance evaluation of existing machine learning-based streaming spam detection methods. In this paper, we bridged the gap by carrying out a performance evaluation, which was from three different aspects of data, feature, and model. A big ground-truth of over 600 million public tweets was created by using a commercial URL-based security tool. For real-time spam detection, we further extracted 12 lightweight features for tweet representation. Spam detection was then transformed to a binary classification problem in the feature space and can be solved by conventional machine learning algorithms. We evaluated the impact of different factors to the spam detection performance, which included spam to nonspam ratio, feature discretization, training data size, data sampling, time-related data, and machine learning algorithms. The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model.
机译:Twitter的流行吸引了越来越多的垃圾邮件发送者。垃圾邮件发送者向Twitter用户发送不需要的推文,以宣传有害于普通用户的网站或服务。为了阻止垃圾邮件发送者,研究人员提出了多种机制。最近的工作重点是将机器学习技术应用于Twitter垃圾邮件检测。但是,推文以流方式检索,Twitter提供了Streaming API,供开发人员和研究人员实时访问公共推文。缺乏对现有基于机器学习的流式垃圾邮件检测方法的性能评估。在本文中,我们通过数据,功能和模型三个不同​​方面的性能评估来弥合差距。通过使用基于URL的商业安全工具,创建了超过6亿条公开推文。对于实时垃圾邮件检测,我们进一步提取了12种轻量级功能,以进行鸣叫表示。然后,将垃圾邮件检测转换为特征空间中的二进制分类问题,并且可以通过常规的机器学习算法来解决。我们评估了不同因素对垃圾邮件检测性能的影响,其中包括垃圾邮件对非垃圾邮件的比率,特征离散化,训练数据大小,数据采样,与时间有关的数据以及机器学习算法。结果表明,流式垃圾邮件推文检测仍然是一个很大的挑战,而可靠的检测技术应考虑数据,功能和模型三个方面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号