首页> 外文期刊>Expert systems with applications >CDBH: A clustering and density-based hybrid approach for imbalanced data classification
【24h】

CDBH: A clustering and density-based hybrid approach for imbalanced data classification

机译:CDBH:基于群集和密度的混合方法,用于实施数据分类

获取原文
获取原文并翻译 | 示例

摘要

The problem of imbalanced data set classification is prevalent in the studies of machine learning and data mining. In these kinds of data sets, the number of samples in classes is unequal so that one class has a lot more samples (the majority or negative class) than the other (the minority or positive class). The classical classifiers are ineffective in these conditions because they are biased toward the majority class and ignore the minority class, which is more important. Preprocessing the data distribution before training the classifier is one of the most effective methods to resolve this problem. These methods, balance the data distribution by decreasing the majority class size (under-sampling methods) or increasing the minority class size (over-sampling methods) or combining both of them (hybrid methods). In this paper, we propose an effective and simple hybrid approach based on the density concept and clustering, which is called Clustering and Density-Based Hybrid (CDBH). First, the minority class samples are clustered by the well-known k-means algorithm and their densities in each cluster are obtained. Then, the denser minority samples are selected with more likely to generate the new minority samples. To decrease the majority class size, the k-means algorithm is applied again on the majority class samples to cluster them and compute their densities, like the previous stage. Finally, the denser majority samples will have more chance to choose from the training set, and other samples are removed to balance the data samples distribution between classes. In the experiments, the Support Vector Machine (SVM) classifier is used as the classifier, and F-measure and AUC criteria are employed for evaluation. Also, preprocessing methods are compared in terms of the complexity of the classification model and the over-sampling rate. The results of comparing CDBH and other state of the art methods over 44 imbalanced data sets show the superiority of the proposed CDBH method based on the F-measure criterion.
机译:机器学习和数据挖掘研究中,数据集分类的问题普遍存在。在这些类型的数据集中,类中的样本数量不等,因此一个类具有比另一个(少数或正类)更多的样本(大多数或负类)。古典分类器在这些条件下无效,因为它们偏向于多数阶级并忽略少数阶级,这更为重要。在培训之前预处理数据分布是分类器是解决此问题的最有效的方法之一。这些方法,通过减少多数类大小(拒绝采样方法)或增加少数类大小(过采样方法)或组合它们(混合方法)来平衡数据分布。在本文中,我们提出了一种基于密度概念和聚类的有效和简单的混合方法,称为聚类和基于密度的混合(CDBH)。首先,少数群体类别采样由众所周知的K-Means算法集群,并且获得每个簇中的密度。然后,选择更强烈的少数群体样本,更有可能产生新的少数群体样本。为了减少大多数类规模,K-Means算法再次在大多数类样本上施加,以将它们群集并计算它们的密度,如前阶段。最后,密度大多数样本将有更多的机会从训练集中进行选择,并且其他样本被删除以平衡类之间的数据样本分布。在实验中,支持向量机(SVM)分类器用作分类器,并且使用F测量和AUC标准进行评估。此外,在分类模型的复杂性和过采样率方面比较预处理方法。比较CDBH和其他最新技术在44个不平衡数据集上的结果表明了基于F测量标准的所提出的CDBH方法的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号