An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification

机译：分类预处理高维类别 - 不平衡数据的实证研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better; (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced; (3) when dataset is less imbalance, preprocessing may not be necessary; (4) In wrapper-based feature selection, we suggest using the simple searching method.

机译：新兴的新数据类型为数据挖掘带来了巨大的挑战。在不同的领域中存在大量的高维类别 - 不平衡数据。在这种情况下，传统的分类方法不合适，因为它们易于确保多数类的准确性。同时，维度的诅咒使情况变得更加复杂。查找复杂的分类器不是一种简单的方法，并且这样的分类器可以为数据过度装备。在分类之前预处理这些数据是一种更直接的方法。对于高维和类别不平衡的跨效果，有必要了解预处理方法（特征选择和数据采样）如何影响最终分类。以前的实验要么对数据集的考虑较少或引入其他特征，以使情况更加复杂。我们使用两种类型的特征选择（包装器和过滤器）和数据采样（过采样和欠采样）方法，具有不同尺寸和四个字段中的不同尺寸和不平衡级别的数据集，并测试对C4.5分类器性能的影响。在我们的设置中，实验说明（1）采样前的特征选择大多是更好的; （2）特征选择和数据采样的组合中，当数据集大大不平衡时，UnderApping比过采样更好; （3）当数据集不太不平衡时，可能不是必需的预处理; （4）在基于包装器的特征选择中，我们建议使用简单的搜索方法。

著录项

来源
《IEEE International Symposium on Cyberspace Safety and Security》|2015年||共6页
会议地点
作者
Hua Yin; Keke Gai;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
High-dimensional class-imbalanced data; Classification; Preprocessing; Feature selection; Sampling;

机译：高维类别 - 不平衡数据;分类;预处理;特征选择;采样;

相似文献

外文文献
中文文献
专利

1. Research on classification method of high-dimensional class-imbalanced datasets based on SVM [J] . Zhang Chunkai, Zhou Ying, Guo Jianwei, International journal of machine learning and cybernetics . 2019,第7期

机译：基于支持向量机的高维类不平衡数据集分类方法研究
2. Research on classification method of high-dimensional class-imbalanced datasets based on SVM [J] . Zhang Chunkai, Zhou Ying, Guo Jianwei, International journal of machine learning and cybernetics . 2019,第7期

机译：基于SVM的高维类别 - 不平衡数据集分类方法研究
3. Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification [J] . Maldonado Sebastian, Lopez Julio Applied Soft Computing . 2018,第期

机译：处理高维类别 - 不平衡数据集：SVM分类的嵌入式功能选择
4. An Empirical Study on Preprocessing High-Dimensional Class-Imbalanced Data for Classification [C] . Hua Yin, Keke Gai 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, 2015 IEEE 12th International Conference on Embedded Software and Systems . 2015

机译：预处理高维类不平衡数据分类的实证研究
5. A computational environment for data preprocessing in supervised classification. [D] . Rodriguez, Caroline K. 2004

机译：在监督分类中进行数据预处理的计算环境。
6. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data [O] . Guang-Hui Fu, Yuan-Jiao Wu, Min-Jie Zong, 2020

机译：高维类不平衡数据基于Hellinger距离的稳定稀疏特征选择
7. Improved shrunken centroid classifiers for high-dimensional class-imbalanced data [O] . Rok Blagus, Lara Lusa 2013

机译：改进的收缩质心分类器，用于处理高维类不平衡数据

An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification

摘要

著录项

相似文献

相关主题

期刊订阅