首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems
【24h】

FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems

机译:FAST:针对小样本和不平衡数据分类问题的基于ROC的特征选择指标

获取原文

摘要

The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifier's suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.
机译:在机器学习和数据挖掘的大量实际应用中会遇到类不平衡问题,例如信息检索和过滤以及信用卡欺诈的检测。人们普遍认识到,这种不平衡会产生一些问题,与平衡类情况相比,这些问题要么不存在,要么不那么严重,并且常常导致分类器的表现欠佳。当不平衡数据也是高维时,情况更是如此。在这种情况下,特征选择方法对于实现最佳性能至关重要。在本文中,我们提出了一种新的特征选择方法,即通过滑动阈值进行特征评估(FAST),该方法基于ROC曲线下的面积,该面积是通过移动单个特征分类器的决策边界并使用偶数仓位放置阈值而生成的分配。对于不平衡的数据分类,将FAST与两种常用的特征选择方法(相关系数和估计特征中的RELevance)进行了比较。在文本挖掘,质谱和微阵列数据集上获得的实验结果表明,该方法在偏斜数据集上优于RELIEF方法和相关方法,在平衡数据集上具有可比性。当首选少量特征时,与相关和基于RELIEF的方法相比,该方法的分类性能得到了显着改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号