首页> 外文会议>IEEE International Symposium on Cyberspace Safety and Security >An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification
【24h】

An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification

机译:分类预处理高维类别 - 不平衡数据的实证研究

获取原文

摘要

The emerging new data types bring tremendous challenges to data mining. There is an enormous amount of high-dimensional class-imbalanced data in different fields. In this case, traditional classification methods are not appropriate because they are prone to ensure the accuracy of the majority class. Meanwhile, the curse of dimensionality makes situations more complicated. Finding a complicated classifier is not an easy way and such a classifier may overfit for the data. Preprocessing these data before classification is a more direct method. For the cross effect of high-dimension and class-imbalance, it is necessary to know about how preprocessing methods (feature selection and data sampling) affect the final classification. Previous experiments either had less considerations on datasets or introduced other characteristics to make the situation more complicated. We use two types of feature selection (wrapper and filter) and data sampling (oversampling and undersampling) methods on twelve selected datasets with different dimensions and imbalanced-level in four fields, and test the effects on the performance of c4.5 classifier. In our setting, experiments state that (1) feature selection before sampling is mostly better; (2) among the combinations of feature selection and data sampling, undersampling performs better than oversampling when the dataset is largely imbalanced; (3) when dataset is less imbalance, preprocessing may not be necessary; (4) In wrapper-based feature selection, we suggest using the simple searching method.
机译:新兴的新数据类型为数据挖掘带来了巨大的挑战。在不同的领域中存在大量的高维类别 - 不平衡数据。在这种情况下,传统的分类方法不合适,因为它们易于确保多数类的准确性。同时,维度的诅咒使情况变得更加复杂。查找复杂的分类器不是一种简单的方法,并且这样的分类器可以为数据过度装备。在分类之前预处理这些数据是一种更直接的方法。对于高维和类别不平衡的跨效果,有必要了解预处理方法(特征选择和数据采样)如何影响最终分类。以前的实验要么对数据集的考虑较少或引入其他特征,以使情况更加复杂。我们使用两种类型的特征选择(包装器和过滤器)和数据采样(过采样和欠采样)方法,具有不同尺寸和四个字段中的不同尺寸和不平衡级别的数据集,并测试对C4.5分类器性能的影响。在我们的设置中,实验说明(1)采样前的特征选择大多是更好的; (2)特征选择和数据采样的组合中,当数据集大大不平衡时,UnderApping比过采样更好; (3)当数据集不太不平衡时,可能不是必需的预处理; (4)在基于包装器的特征选择中,我们建议使用简单的搜索方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号