首页> 外文会议>International Conference on Information Management and Technology >Comparison of Feature Selection for Imbalance Text Datasets
【24h】

Comparison of Feature Selection for Imbalance Text Datasets

机译:不平衡文本数据集特征选择的比较

获取原文

摘要

The numbers of documents are increasing rapidly in a web format. Therefore, automatic document classification is needed to help human to classify the documents. Text classification is one of the common tasks in text mining problems. In order to build a model that is able to classify a document, the words are the main source as a feature to create a model. Because there are so many words in a corpus, we need to be selective which features that are significant to the labels. Feature selection has been introduced to improve the classification task. Moreover, it could be used to reduce the high dimensionality feature space. Feature selection becomes one of the most familiar solutions to high dimensionality problem of document classification. In text classification, selection of good features plays an important role. Feature selection is an approach that can be used to increase model classification accuracy and computational efficiency. This paper presents an empirical study of the most widely used feature selection methods. Term Frequency (TF), Mutual Information (MI), and Chi-square (X2) with 2 distinct classifiers Support Vector Machine (SVM) and Multinomial Naïve Bayes (MNB). The experimentations are tested out on commonly used benchmark datasets such as 20-Newsgroups, Reuters and our dataset. Because there are some parameters on how many features that we should take given that documents, we use the best 10 percent until 20 percent of features to test it out. The obtained results show that the six experiments that has been conducted Chi-squared is out as the best performance for text classification.
机译:文档的数量以Web格式迅速增加。因此,需要自动文档分类来帮助人类对文档进行分类。文本分类是文本挖掘问题中的常见任务之一。为了构建能够对文档进行分类的模型,这些单词是创建模型的主要源。因为语料库中有这么多词,我们需要选择对标签很重要的功能。已经引入了功能选择以改善分类任务。此外,它可用于减少高维度特征空间。特征选择成为文档分类的高维问题最熟悉的解决方案之一。在文本分类中,选择良好的功能起着重要作用。特征选择是一种方法,可用于提高模型分类准确性和计算效率。本文提出了对最广泛使用的特征选择方法的实证研究。具有2个不同分类器的术语频率(TF),互信息(MI)和Chi-Square(X2)支持向量机(SVM)和多项式Naïve贝叶斯(MNB)。在常用的基准数据集中测试了实验,例如20-Newsgroups,路透社和我们的数据集。因为有一些参数有关我们应该采取的一些功能,我们使用最好的10%,直到20%的功能来测试它。所得结果表明,已经进行了奇平方的六个实验是文本分类的最佳性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号