FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems

机译：FAST：针对小样本和不平衡数据分类问题的基于ROC的特征选择指标

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifier's suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.

机译：在机器学习和数据挖掘的大量实际应用中会遇到类不平衡问题，例如信息检索和过滤以及信用卡欺诈的检测。人们普遍认识到，这种不平衡会产生一些问题，与平衡类情况相比，这些问题要么不存在，要么不那么严重，并且常常导致分类器的表现欠佳。当不平衡数据也是高维时，情况更是如此。在这种情况下，特征选择方法对于实现最佳性能至关重要。在本文中，我们提出了一种新的特征选择方法，即通过滑动阈值进行特征评估（FAST），该方法基于ROC曲线下的面积，该面积是通过移动单个特征分类器的决策边界并使用偶数仓位放置阈值而生成的分配。对于不平衡的数据分类，将FAST与两种常用的特征选择方法（相关系数和估计特征中的RELevance）进行了比较。在文本挖掘，质谱和微阵列数据集上获得的实验结果表明，该方法在偏斜数据集上优于RELIEF方法和相关方法，在平衡数据集上具有可比性。当首选少量特征时，与相关和基于RELIEF的方法相比，该方法的分类性能得到了显着改善。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|106-114|共9页
会议地点
作者
Xue-wen Chen; Michael Wasikowski;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
feature selection; imbalanced data classification; ROC;

机译：特征选择;不平衡数据分类; ROC;

相似文献

外文文献
中文文献
专利

1. ROC-based utility function maximization for feature selection and classification with applications to high-dimensional protease data. [J] . Liu Z, Tan M Biometrics: Journal of the Biometric Society : An International Society Devoted to the Mathematical and Statistical Aspects of Biology . 2008,第4期

机译：基于ROC的效用函数最大化，可用于特征选择和分类，并应用于高维蛋白酶数据。
2. l(2,1) norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification [J] . Cao Peng, Liu Xiaoli, Zhang Jian, Neurocomputing . 2017,第APRa19期

机译：基于l（2,1）范数正则化的多核联合非线性特征选择和过采样用于不平衡数据分类
3. Classification of Real Imbalanced Cardiovascular Data Using Feature Selection and Sampling Methods: A Case Study with Neural Networks and Logistic Regression [J] . Bektas Jale, Ibrikci Turgay, Ozcan Ismail Turkay International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2017,第6期

机译：使用特征选择和采样方法对真实不平衡心血管数据进行分类 - 以神经网络和逻辑回归为例
4. FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems [C] . ACMKDD International Conference on Knowledge Discovery and Data Mining . 2008

机译：快速：基于ROC的特征选择度量，用于小样本和不平衡数据分类问题
5. Automation of Feature Selection and Generation of Optimal Feature Subsets for Beehive Audio Sample Classification [D] . Bhouraskar, Aditya. 2020

机译：蜂箱音频样本分类的特征选择和最佳特征子集的生成
6. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data [O] . Kung-Jeng Wang, Bunjira Makond, Kung-Min Wang 2013

机译：通过使用采样和特征选择技术解决不平衡的患者分类数据提高乳腺癌的生存率
7. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data [O] . 2013

机译：通过使用采样和特征选择技术解决不平衡的患者分类数据，提高乳腺癌的生存率

FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems

摘要

著录项

相似文献

相关主题

期刊订阅