首页> 外文期刊>Pattern Analysis and Applications >Term frequency combined hybrid feature selection method for spam filtering
【24h】

Term frequency combined hybrid feature selection method for spam filtering

机译:用于垃圾邮件过滤的词频组合混合特征选择方法

获取原文
获取原文并翻译 | 示例

摘要

Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Na < ve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.
机译:特征选择是提高垃圾邮件过滤效率和准确性的一项重要技术。在众多方法中,基于文档频率的特征选择会忽略术语频率信息的影响,因此总是会得出不令人满意的结果。本文提出了一种混合方法(称为HBM),该方法将文档频率信息和术语频率信息相结合。为了保持所选特征的类别区分能力,选择了基于文档频率的最佳特征选择(称为ODFFS);确实具有歧视性的术语将由ODFFS选择。对于其余特征,考虑术语频率信息,并选择具有最高HBM值的术语。此外,提出了一种称为特征子集评估参数优化(FSEPO)的新方法进行参数优化。支持向量机(SVM)和朴素贝叶斯(NB)分类器的实验应用于四个语料库:PU1,LingSpam,SpamAssian和Trec2007。将六种特征选择:信息增益,卡方,改进的基尼系数,多类优势比,归一化项基于频率的判别能力度量以及全面度量的特征选择与HBM进行了比较。实验结果表明,在分别支持SVM和NB的情况下,HBM明显优于其他四个语料库的特征选择方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号