首页> 外文期刊>Expert systems with applications >A novel filter feature selection method using rough set for short text data
【24h】

A novel filter feature selection method using rough set for short text data

机译:一种新的滤波器特征选择方法,用于短文本数据的粗糙集

获取原文
获取原文并翻译 | 示例

摘要

High dimensionality problem is an important concern for short text classification due to its effect on computational cost and accuracy of classifiers. Also, short text data, besides being high dimensional, has an incomplete, inconsistent and sparse structure. Selection of important features that provide a better representation is a solution for high dimensionality problem. In this study, we developed a novel filter feature selection method, Proportional Rough Feature Selector (PRFS), which uses the rough set for a regional distinction according to the value set of term to identify documents that exactly belong to a class or that is possibly belong to a class. Documents possible to belong to a class are penalized by multiplying with a coefficient named a. Additionally, the effect of sparsity in the term vector space is calculated with the help of rough set. The PRFS is compared with state-of-the-art filter feature selection methods such as Gini index, information gain, distinguishing feature selector, recently proposed max-min ratio, and normalized difference measure methods. The comparison is carried out using various feature sizes on four different short text datasets with a Macro-F1 success measure. Experimental results demonstrated that the PRFS offers either better or competitive performance with respect to other feature selection methods in terms of Macro-F1. This study may be a pioneering study in this research field as it proposes a novel feature selection method for short text classification using a rough set theory. (c) 2020 Elsevier Ltd. All rights reserved.
机译:由于其对分类器的计算成本和准确性的影响,高维数问题是短文本分类的重要关注。此外,除了高维度之外,短文本数据具有不完整,不一致和稀疏的结构。选择提供更好代表性的重要特征是高度问题的解决方案。在这项研究中,我们开发了一种新颖的滤波器特征选择方法,比例粗糙的特征选择器(PRF),它根据术语的值集来使用粗糙集进行区域区分,以识别完全属于类或可能的文件属于一个课程。可以通过命名为a的系数来乘以一个属于类的文档受到惩罚。另外,利用粗糙集的帮助计算术语术语矢量空间中的稀疏性的影响。将PRF与最先进的过滤器特征选择方法进行比较,例如Gini索引,信息增益,区分特征选择器,最近提出的MAX-MIN比和归一化差分测量方法。使用宏F1成功测量的四个不同的短文本数据集上使用各种特征大小进行比较。实验结果表明,在宏F1方面,PRFS在其他特征选择方法方面提供了更好或更有竞争的性能。本研究可能是本研究领域的开创性研究,因为它提出了一种使用粗糙集理论的短文本分类的新颖特征选择方法。 (c)2020 elestvier有限公司保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号