...
首页> 外文期刊>IEICE transactions on information and systems >Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification
【24h】

Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

机译:改进的Gini-Index算法纠正文本分类中的特征选择偏差

获取原文
   

获取外文期刊封面封底 >>

       

摘要

This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4%, 15.9%, 3.3%, 2.8% and 2.9% (kNN) in Micro-F1, 14%, 9.8%, 9.2%, 3.5% and 4.3% (SVM) in Micro-F1, 20%, 16.9%, 2.8%, 3.6% and 3.1% (kNN) in Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) in Macro-F1, compared with tf*idf, χ~(2), Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.
机译:本文提出了一种改进的Gini-Index算法来纠正文本分类中的特征选择偏差。基尼系数(Gini-Index)已被用作拆分度量,用于在决策树中选择最合适的拆分属性。最近,引入了一种改进的Gini-Index特征选择算法,该算法是针对文本分类而设计的,并且基于Gini-Index理论,它已被证明比其他方法更好。但是,我们发现,基尼索引在文本分类中仍然显示出特征选择偏差,特别是对于具有大量特征的不平衡数据集。基尼索引在特征选择中的特征选择偏差以三种方式显示:1)低频特征的基尼值总体上较低(基于纯度度量),而与类别之间特征的分布无关,2)较高-频率特征,基尼值始终相对较高; 3)对于属于大类的特定特征,基尼值相对低于属于小类的特征。因此,为了纠正偏倚并使用Gi​​ni-Index改进文本分类中的特征选择,我们提出了一种改进的Gini-Index(I-GI)算法,该算法具有三个重新构造的Gini-Index表达式。在本研究中,我们使用全局降维(DR)和局部DR来衡量特征选择中特征的优劣。在I-GI算法的实验结果中,我们获得了无偏特征值并消除了许多不相关的常规特征,同时保留了许多特定特征。此外,当使用局部DR方法时,我们可以改善整体分类性能。在Micro-F1中,分类性能的总平均值分别提高了19.4%,15.9%,3.3%,2.8%和2.9%(kNN),在F1中分别为14%,9.8%,9.2%,3.5%和4.3%(SVM)。与Micro-F1相比,Macro-F1中分别为20%,16.9%,2.8%,3.6%和3.1%(kNN),Macro-F1中为16.3%,14%,7.1%,4.4%,6.3%(SVM) tf * idf,χ〜(2),信息增益,赔率和根据每个分类器的现有Gini-Index方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号