首页> 外文期刊>Neurocomputing >A two-stage Markov blanket based feature selection algorithm for text classification
【24h】

A two-stage Markov blanket based feature selection algorithm for text classification

机译:基于两阶段马尔可夫毯的文本分类特征选择算法

获取原文
获取原文并翻译 | 示例
           

摘要

Designing a good feature selection (FS) algorithm is of utmost importance especially for text classification (TC), wherein a large number of features representing terms or words pose serious challenges to the effectiveness and efficiency of classifiers. FS algorithms are divided into two broad categories, namely, feature ranking (FR) and feature subset selection (FSS) algorithms. Unlike FSS, FR algorithms select those features that are individually highly relevant for the class or category without taking the feature interactions into account. This makes FR algorithms simple and computationally more efficient than FSS and thus, mostly a preferred choice for TC. Bi-normal separation (BNS) (Forman, 2003) and information gain (IG) (Yang and Pedersen, 1997) are well-known FR metrics. However, FR algorithms output a set of highly relevant features or terms which can possibly be redundant and can thus, deteriorate a classifier's performance. This paper suggests taking the interactions of words into account in order to eliminate redundant terms. Stand-alone FSS algorithms can be computationally expensive for the high-dimensional text data. We therefore suggest a two-stage FS algorithm, which employs an FR metric such as BNS or IG in the first stage and an FSS algorithm such as the Markov blanket filter (MBF) (Roller and Sahami, 1996) in the second stage. Most of the two-stage algorithms proposed in the literature for TC combine feature ranking and feature transformation such as principal component analysis (PCA) algorithms. To estimate the statistical significance of our two-stage algorithm, we carry out experiments on 10 different splits of training and test sets of each of the three (Reuters-21578, TREC, OHSUMED) data sets with naive Bayes' and support vector machines. Our results based on a paired two-sided t-test show that the macro F_1 performance of BNS+MBF is statistically significant than that of stand-alone BNS in 69% of the total experimental trials. The macro F_1 values of IG get enhanced in 72% of the trials when MBF is used in the second stage. We also compare our two-stage algorithm against two recently proposed FS algorithms, namely, distinguishing feature selector (DFS) (Uysal and Gunal, 2012) and a two stage algorithm consisting of IG and PCA algorithms (Uguz, 2011). BNS+MBF is found to be significantly better than DFS and IG+PCA in 74 and 78% of the trials respectively. IG+MBF outperforms DFS and IG+PCA in 93 and 80% of the experimental trials respectively. Similar results are observed for BNS+MBF and IG+MBF when the performances are evaluated in terms of balanced error rate.
机译:设计良好的特征选择(FS)算法至关重要,尤其是对于文本分类(TC)而言,其中代表词或单词的大量特征给分类器的有效性和效率带来了严峻挑战。 FS算法分为两大类,即特征排名(FR)和特征子集选择(FSS)算法。与FSS不同,FR算法选择与类或类别高度相关的那些功能,而不考虑功能交互。这使得FR算法比FSS简单且计算效率更高,因此,大多数是TC的首选。双正态分离(BNS)(Forman,2003)和信息增益(IG)(Yang and Pedersen,1997)是众所周知的FR指标。但是,FR算法会输出一组高度相关的功能或术语,这些功能或术语可能是多余的,因此可能会使分类器的性能下降。本文建议考虑单词的交互作用,以消除多余的术语。对于高维文本数据,独立的FSS算法在计算上可能会很昂贵。因此,我们建议采用两阶段的FS算法,在第一阶段采用FR度量,例如BNS或IG,在第二阶段采用FSS算法,例如马尔可夫毯式滤波器(MBF)(Roller和Sahami,1996)。文献中为TC提出的大多数两阶段算法都将特征排名和特征转换结合在一起,例如主成分分析(PCA)算法。为了估算两阶段算法的统计意义,我们使用朴素贝叶斯和支持向量机对三个(Reuters-21578,TREC,OHSUMED)数据集的10个不同的训练集和测试集进行了实验。我们基于配对的双向t检验的结果表明,在全部实验试验中,BNS + MBF的宏F_1性能比独立BNS的F_1性能具有统计学意义。在第二阶段使用MBF时,IG的宏F_1值在72%的试验中得到了增强。我们还将我们的两阶段算法与最近提出的两种FS算法进行了比较,即区分特征选择器(DFS)(Uysal和Gunal,2012)和由IG和PCA算法组成的两阶段算法(Uguz,2011)。在分别有74%和78%的试验中,发现BNS + MBF显着优于DFS和IG + PCA。 IG + MBF分别在93%和80%的试验中优于DFS和IG + PCA。当根据平衡错误率评估性能时,对于BNS + MBF和IG + MBF观察到相似的结果。

著录项

  • 来源
    《Neurocomputing》 |2015年第1期|91-104|共14页
  • 作者单位

    Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan;

    Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan;

    Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Text classification; Two-stage feature selection; Markov blanket discovery;

    机译:文字分类;两阶段特征选择;马尔可夫毯发现;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号