首页> 外文期刊>Pattern Analysis and Applications >A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection
【24h】

A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection

机译:基于欠采样和克隆选择的新型集成决策树用于网络垃圾邮件检测

获取原文
获取原文并翻译 | 示例
           

摘要

Currently, web spamming is a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a novel ensemble classifier for web spam detection which combines the clonal selection algorithm for feature selection and under-sampling for data balancing. This web spam detection system is called USCS. The USCS ensemble classifiers can automatically sample and select sub-classifiers. First, the system will convert the imbalanced training dataset into several balanced datasets using the under-sampling method. Second, the system will automatically select several optimal feature subsets for each sub-classifier using a customized clonal selection algorithm. Third, the system will build several C4.5 decision tree sub-classifiers from these balanced datasets based on its specified features. Finally, these sub-classifiers will be used to construct an ensemble decision tree classifier which will be applied to classify the examples in the testing data. Experiments on WEBSPAM-UK2006 dataset on the web spam problem show that our proposed approach, the USCS ensemble web spam classifier, contributes significant classification performance compared to several baseline systems and state-of-the-art approaches.
机译:当前,网络垃圾邮件对于搜索引擎来说是一个严重的问题。它不仅通过有意向用户增加不需要的网页而降低了搜索结果的质量,而且还导致搜索引擎在处理无用信息时浪费了大量的计算和存储资源。在本文中,我们提出了一种用于垃圾邮件检测的新型集成分类器,该分类器结合了用于特征选择的克隆选择算法和用于数据平衡的欠采样。此网络垃圾邮件检测系统称为USCS。 USCS集成分类器可以自动采样和选择子分类器。首先,系统将使用欠采样方法将不平衡训练数据集转换为几个平衡数据集。其次,系统将使用定制的克隆选择算法为每个子分类器自动选择几个最佳特征子集。第三,系统将基于其指定特征从这些平衡数据集中构建几个C4.5决策树子分类器。最后,这些子分类器将用于构建整体决策树分类器,该分类器将用于对测试数据中的示例进行分类。对WEBSPAM-UK2006数据集进行的有关垃圾邮件问题的实验表明,与几种基准系统和最新技术相比,我们提出的方法(USCS整体垃圾邮件分类器)显着提高了分类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号