Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

Heum PARK; Hyuk-Chul KWON

首页> 外文期刊>IEICE transactions on information and systems >Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

【24h】

Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

机译：改进的Gini-Index算法纠正文本分类中的特征选择偏差

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4%, 15.9%, 3.3%, 2.8% and 2.9% (kNN) in Micro-F1, 14%, 9.8%, 9.2%, 3.5% and 4.3% (SVM) in Micro-F1, 20%, 16.9%, 2.8%, 3.6% and 3.1% (kNN) in Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) in Macro-F1, compared with tf*idf, χ~(2), Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.

机译：本文提出了一种改进的Gini-Index算法来纠正文本分类中的特征选择偏差。基尼系数（Gini-Index）已被用作拆分度量，用于在决策树中选择最合适的拆分属性。最近，引入了一种改进的Gini-Index特征选择算法，该算法是针对文本分类而设计的，并且基于Gini-Index理论，它已被证明比其他方法更好。但是，我们发现，基尼索引在文本分类中仍然显示出特征选择偏差，特别是对于具有大量特征的不平衡数据集。基尼索引在特征选择中的特征选择偏差以三种方式显示：1）低频特征的基尼值总体上较低（基于纯度度量），而与类别之间特征的分布无关，2）较高-频率特征，基尼值始终相对较高; 3）对于属于大类的特定特征，基尼值相对低于属于小类的特征。因此，为了纠正偏倚并使用Gini-Index改进文本分类中的特征选择，我们提出了一种改进的Gini-Index（I-GI）算法，该算法具有三个重新构造的Gini-Index表达式。在本研究中，我们使用全局降维（DR）和局部DR来衡量特征选择中特征的优劣。在I-GI算法的实验结果中，我们获得了无偏特征值并消除了许多不相关的常规特征，同时保留了许多特定特征。此外，当使用局部DR方法时，我们可以改善整体分类性能。在Micro-F1中，分类性能的总平均值分别提高了19.4％，15.9％，3.3％，2.8％和2.9％（kNN），在F1中分别为14％，9.8％，9.2％，3.5％和4.3％（SVM）。与Micro-F1相比，Macro-F1中分别为20％，16.9％，2.8％，3.6％和3.1％（kNN），Macro-F1中为16.3％，14％，7.1％，4.4％，6.3％（SVM） tf * idf，χ〜（2），信息增益，赔率和根据每个分类器的现有Gini-Index方法。

著录项

来源
《IEICE transactions on information and systems》 |2011年第4期|共11页
作者
Heum PARK; Hyuk-Chul KWON;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification [J] . Heum PARK, Hyuk-Chul KWON IEICE Transactions on Information and Systems . 2011,第4期

机译：改进的Gini-Index算法纠正文本分类中的特征选择偏差
2. A Two-stage Text Feature Selection Algorithm for Improving Text Classification [J] . Ashokkumar P., Shankar Siva G., Srivastava Gautam, ACM transactions on Asian and low-resource language information processing . 2021,第3期

机译：改进文本分类的两级文本特征选择算法
3. IMPROVED TEXT FEATURE SELECTION ALGORITHMS IN CLASSIFICATION SEARCH OF ENVIRONMENTAL PROTECTION INFORMATION [J] . RONGJIE YANG, SHUAI MAN Journal of Environmental Protection and Ecology . 2019,第3期

机译：环保信息分类搜索中改进的文本特征选择算法
4. Complete Gini-Index Text (GIT) feature-selection algorithm for text classification [C] . Park Heum, Kwon Soonho, Kwon Hyuk-Chul The 2nd International Conference on Software Engineering and Data Mining . 2010

机译：用于文本分类的完整Gini-Index Text（GIT）功能选择算法
5. Improved Feature-Selection for Classification Problems using Multiple Auto-Encoders [D] . Guo, Xinyu. 2018

机译：使用多个自动编码器改进了分类问题的特征选择
6. Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization [O] . Jieming Yang, Zhaoyang Qu, Zhiying Liu -1

机译：文本分类中考虑不平衡问题的改进特征选择方法
7. Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification [O] . Heum PARK, Hyuk-Chul KWON 2011

机译：改进了基尼索引算法，以更正文本分类中的特征选择偏差

Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

摘要

著录项

相似文献

相关主题

期刊订阅