...
首页> 外文期刊>International journal of intelligent information and database systems >An Earth mover's distance-based undersampling approach for handling class-imbalanced data
【24h】

An Earth mover's distance-based undersampling approach for handling class-imbalanced data

机译:用于处理类别不平衡数据的地球移动器的距离采样方法

获取原文
获取原文并翻译 | 示例
           

摘要

Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.
机译:不平衡数据集通常使预测精度困难。大多数真实世界的数据本质上是不平衡的。传统的分类器假设用于训练数据的良好平衡的类分发,但在实际数据集中出现不平衡,因此遮挡了分类器并降低了其从这些不平衡数据集中学习的能力。数据预处理方法通过使用随机的下采样或过采样技术来解决此问题。在本文中,我们引入了地球移动器的距离(EMD),作为相似度测量,找到类似于性质中类似的样本,并将它们从数据集中消除冗余。地球移动器的距离在诸如计算机视觉,图像检索,机器学习等方面的广泛领域受到了很多关注。地球移动器的距离基础采样方法在数据级别提供了一个解决方案,以消除多数样本中的冗余实例损失有价值的信息。该方法分别用五个传统的分类器和一个集成技术实现,如C4.5决策树(DT),K-最近邻(k-Nn),多层Perceptron(MLP),支持向量机(SVM),幼稚贝叶斯( NB)和Adaboost技术。该方法在龙骨存储库中产生了21个数据集的卓越性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号