...
首页> 外文期刊>Neurocomputing >Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem
【24h】

Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem

机译:Fast-CBUS:一种基于快速聚类的欠采样方法,用于解决类不平衡问题

获取原文
获取原文并翻译 | 示例

摘要

Datasets that have imbalanced class distributions pose a challenge for learning and classification algorithms. Imbalanced datasets exist in many domains, such as: fraud detection, sentiment analysis, churn prediction, and intrusion detection in computer networks. To solve the imbalance problem, three main approaches are typically used: data resampling, method adaptation and cost-sensitive learning; of these, data resampling, either oversampling the minority class instances or undersampling the majority class instances, is the most used approach. However, in most cases, when implementing these approaches, there is a trade-off between the predictive performance and the complexity. In this paper we introduce a fast, novel clustering-based undersampling technique for addressing binary-class imbalance problems, which demonstrates high predictive performance, while its time complexity is bound by the size of the minority class instances. During the training phase, the algorithm clusters the minority instances and selects a similar number of majority instances from each cluster. A specific classifier is then trained for each cluster. An unlabeled instance is classified as the majority class if it does not fit into any of the clusters. Otherwise, cluster-specific classifiers are used to return the instance's classification, and the results are weighted by the inverse-distance from the clusters. Our evaluation includes several state-of-the-art methods. We plot the Pareto frontier for various datasets, to consider both computational cost and predictive performance measures. Extensive sets of experiments demonstrate that only the suggested method is always found on the frontier. (C) 2017 Elsevier B.V. All rights reserved.
机译:类分布不平衡的数据集对学习和分类算法提出了挑战。不平衡的数据集存在于许多领域,例如:计算机网络中的欺诈检测,情感分析,客户流失预测和入侵检测。为了解决不平衡问题,通常使用三种主要方法:数据重采样,方法适应和成本敏感型学习;其中,最常用的方法是对少数类实例进行过度采样或对多数类实例进行欠采样。但是,在大多数情况下,实施这些方法时,在预测性能和复杂性之间需要权衡。在本文中,我们介绍了一种快速,新颖的基于聚类的欠采样技术来解决二进制类别的不平衡问题,该技术表现出较高的预测性能,而其时间复杂度受少数类别实例的大小限制。在训练阶段,该算法将少数实例聚类,并从每个聚类中选择相似数量的多数实例。然后针对每个聚类训练特定的分类器。如果未标记的实例不适合任何群集,则将其分类为多数类。否则,将使用特定于群集的分类器来返回实例的分类,并通过与群集的反距离来对结果进行加权。我们的评估包括几种最先进的方法。我们绘制各种数据集的帕累托边界,以考虑计算成本和预测性能指标。大量的实验表明,始终仅在边界上找到建议的方法。 (C)2017 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Neurocomputing 》 |2017年第21期| 88-102| 共15页
  • 作者单位

    Ben Gurion Univ Negev, Dept Software & Informat Syst Engn, POB 653, IL-8410501 Beer Sheva, Israel;

    Ben Gurion Univ Negev, Dept Software & Informat Syst Engn, POB 653, IL-8410501 Beer Sheva, Israel;

    Ben Gurion Univ Negev, Dept Software & Informat Syst Engn, POB 653, IL-8410501 Beer Sheva, Israel;

    Ben Gurion Univ Negev, Dept Software & Informat Syst Engn, POB 653, IL-8410501 Beer Sheva, Israel;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Classification; Data mining; Undersampling; Imbalanced data distribution; Data partitioning;

    机译:分类;数据挖掘;欠采样;数据分配不均衡;数据分区;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号