Comparison of Feature Selection for Imbalance Text Datasets

机译：不平衡文本数据集特征选择的比较

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The numbers of documents are increasing rapidly in a web format. Therefore, automatic document classification is needed to help human to classify the documents. Text classification is one of the common tasks in text mining problems. In order to build a model that is able to classify a document, the words are the main source as a feature to create a model. Because there are so many words in a corpus, we need to be selective which features that are significant to the labels. Feature selection has been introduced to improve the classification task. Moreover, it could be used to reduce the high dimensionality feature space. Feature selection becomes one of the most familiar solutions to high dimensionality problem of document classification. In text classification, selection of good features plays an important role. Feature selection is an approach that can be used to increase model classification accuracy and computational efficiency. This paper presents an empirical study of the most widely used feature selection methods. Term Frequency (TF), Mutual Information (MI), and Chi-square (X2) with 2 distinct classifiers Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB). The experimentations are tested out on commonly used benchmark datasets such as 20-Newsgroups, Reuters and our dataset. Because there are some parameters on how many features that we should take given that documents, we use the best 10 percent until 20 percent of features to test it out. The obtained results show that the six experiments that has been conducted Chi-squared is out as the best performance for text classification.

机译：文档的数量以Web格式迅速增加。因此，需要自动文档分类来帮助人类对文档进行分类。文本分类是文本挖掘问题中的常见任务之一。为了构建能够对文档进行分类的模型，这些单词是创建模型的主要源。因为语料库中有这么多词，我们需要选择对标签很重要的功能。已经引入了功能选择以改善分类任务。此外，它可用于减少高维度特征空间。特征选择成为文档分类的高维问题最熟悉的解决方案之一。在文本分类中，选择良好的功能起着重要作用。特征选择是一种方法，可用于提高模型分类准确性和计算效率。本文提出了对最广泛使用的特征选择方法的实证研究。具有2个不同分类器的术语频率（TF），互信息（MI）和Chi-Square（X2）支持向量机（SVM）和多项式Naïve贝叶斯（MNB）。在常用的基准数据集中测试了实验，例如20-Newsgroups，路透社和我们的数据集。因为有一些参数有关我们应该采取的一些功能，我们使用最好的10％，直到20％的功能来测试它。所得结果表明，已经进行了奇平方的六个实验是文本分类的最佳性能。

著录项

来源
《International Conference on Information Management and Technology》|2019年|68-72|共5页
会议地点
作者
Andreas Chandra;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Feature extraction; Support vector machines; Classification algorithms; Text categorization; Mutual information; Benchmark testing; Filtering algorithms;

机译：特征提取;支持向量机;分类算法;文本分类;互信息;基准测试;过滤算法;

相似文献

外文文献
中文文献
专利

1. Comparison of metrics for feature selection in imbalanced text classification [J] . Hiroshi Ogura, Hiromi Amano, Masato Kondo Expert Systems with Application . 2011,第5期

机译：不平衡文本分类中的特征选择指标比较
2. Feature Selection and Ensemble Learning Techniques in One-Class Classifiers: An Empirical Study of Two-Class Imbalanced Datasets [J] . Chih-Fong Tsai, Wei-Chao Lin Quality Control, Transactions . 2021,第1期

机译：单级分类器中的特征选择和集合学习技术：两级不平衡数据集的实证研究
3. A contemporary feature selection and classification framework for imbalanced biomedical datasets [J] . Thulasi Bikku, Sambasiva Rao Nandam, Ananda Rao Akepogu Egyptian Informatics Journal . 2018,第3期

机译：不平衡生物医学数据集的当代特征选择和分类框架
4. Comparison of Feature Selection for Imbalance Text Datasets [C] . Andreas Chandra International Conference on Information Management and Technology . 2019

机译：不平衡文本数据集特征选择的比较
5. Parallel Feature Selection of Multiple Class Datasets Using Apache Spark [D] . Sankineni, Rishi 2017

机译：使用Apache Spark的多个类数据集的并行特征选择
6. Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization [O] . Jieming Yang, Zhaoyang Qu, Zhiying Liu -1

机译：文本分类中考虑不平衡问题的改进特征选择方法
7. Feature selection and classification of imbalanced datasets. Application to PET images of children with Autistic Spectrum Disorders [O] . Duchesnay Edouard, Cachia Arnaud, Boddaert Nathalie, 2011

机译：不平衡数据集的特征选择和分类。在自闭症儿童PET图像中的应用

Comparison of Feature Selection for Imbalance Text Datasets

摘要

著录项

相似文献

相关主题

期刊订阅