首页> 外文会议>Pacific-Asia conference on advances in knowledge discovery and data mining;PAKDD 2012 >A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling
【24h】

A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling

机译:基于修剪的精确少数族群过采样精确区域搜索方法

获取原文

摘要

One solution to deal with class imbalance is to modify its class distribution. Synthetic over-sampling is a well-known method to modify class distribution by generating new synthetic minority data. Synthetic Minority Over-sampling TEchnique (SMOTE) is a state-of-the-art synthetic over-sampling algorithm that generates new synthetic data along the line between the minority data and their selected nearest neighbors. Advantages of SMOTE is to have decision regions larger and less specific to original data. However, its drawback is the over-generalization problem where synthetic data is generated into majority class region. Over-generalization leads to misclassify non-minority class region into minority class. To overcome the over-generalization problem, we propose an algorithm, called TRIM, to search for precise minority region while maintaining its generalization. TRIM iteratively filters out irrelevant majority data from the precise minority region. Output of the algorithm is the multiple set of seed minority data, and each individual set will be used for generating new synthetic data. Compared with state-of-the-art over-sampling algorithms, experimental results show significant performance improvement in terms of F-measure and AUC. This suggests over-generalization has a significant impact on the performance of the synthetic over-sampling method.
机译:解决类不平衡的一种解决方案是修改其类分布。合成过采样是一种众所周知的方法,可以通过生成新的合成少数数据来修改类的分布。综合少数族裔过采样技术(SMOTE)是一种先进的综合性过采样算法,可沿少数族裔数据与其选定的最近邻居之间的直线生成新的综合数据。 SMOTE的优点是可以使决策区域更大,而对原始数据的针对性则较小。但是,它的缺点是过度概括的问题,其中合成数据生成到多数类区域中。过度概括会导致将非少数族裔地区错误地分类为少数族裔阶层。为了克服过度概括的问题,我们提出了一种称为TRIM的算法,可以在保持其普遍性的同时搜索精确的少数区域。 TRIM迭代地从精确的少数区域中过滤掉无关的多数数据。该算法的输出是种子少数数据的多个集合,并且每个单独的集合将用于生成新的合成数据。与最新的过采样算法相比,实验结果表明,在F度量和AUC方面,性能有了显着提高。这表明过度概括对合成过度采样方法的性能有重大影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号