首页> 外文会议>2011 IEEE International Conference on Granular Computing >Rough set and its application in Chinese spam filtering
【24h】

Rough set and its application in Chinese spam filtering

机译:粗糙集及其在中文垃圾邮件过滤中的应用

获取原文

摘要

Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholds (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Existing experiments show IG is one of the most effective methods. In this paper, a feature selection method is proposed based on Rough Set theory and according to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. rough set theory assume that knowledge is an ability to partition objects. We quantify the ability of classify objects, and call the amount of this ability as knowledge quantity and then following this quantification we put forward a notion ”knowledge Gain” and propose a knowledge gain feature selection method (KG method)The task of spam filtering can be seen as a special problem of text classification. An effective and efficient feature selection method is important, which can be easily and effectively select the major features to attain the goal for anti-spam filtering. We explore 2 classifiers (Naive Bayes and SVM), and run our experiments on Chinese-spam collection show that KG performs better than the IG method, specially, on extremely aggressive reduction. We conclude that the KG feature method have a state-of-the-art performance for filtering spam, especially for Chinese spam emails.
机译:特征选择在文本分类中起着重要作用。文本分类中通常采用自动特征选择方法,例如文档频率阈值(DF),信息增益(IG),互信息(MI)等。现有实验表明,IG是最有效的方法之一。在本文中,提出了一种基于粗糙集理论的特征选择方法,根据粗糙集理论,可以基于对象的某些属性将关于对象宇宙的知识定义为分类,即,粗糙集理论假设知识是划分对象的能力。我们对分类对象的能力进行量化,并将这种能力的数量称为知识量,然后根据此量化提出“知识增益”概念,并提出知识增益特征选择方法(KG方法)。被视为文本分类的特殊问题。有效且高效的特征选择方法很重要,它可以轻松有效地选择主要特征,以实现反垃圾邮件过滤的目标。我们探索了两个分类器(朴素贝叶斯(Naive Bayes)和支持向量机(SVM)),并在中国垃圾邮件收集上进行了实验,结果表明,相比于IG方法,KG的性能要好于IG方法,特别是在极度减少攻击方面。我们得出的结论是,KG功能方法具有过滤垃圾邮件的最新性能,特别是对于中国垃圾邮件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号