首页> 外文期刊>Knowledge-Based Systems >Multi-class imbalanced big data classification on Spark
【24h】

Multi-class imbalanced big data classification on Spark

机译:火花上的多级不平衡大数据分类

获取原文
获取原文并翻译 | 示例

摘要

Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. (C) 2020 Elsevier B.V. All rights reserved.
机译:尽管有两十年的进步,但从不平衡数据的学习仍被视为机器学习中当代挑战之一。由于大数据时代的出现,这一点进一步复杂,其中致力于缓解类偏斜影响的流行算法由于数据集的数量而不再可行。此外,大多数现有算法侧重于二元不平衡问题,其中大多数和少数群体都是明确定义的。多级不平衡数据造成进一步的挑战,因为类之间的关系更复杂并且简单地分解成多个二进制问题导致了大量信息损失。在本文中,我们提出了用于处理多级大数据问题的第一种复合框架,同时寻址多个类和高卷数据。我们建议分析每个班级的实例级别困难,导致了解学习困难的原因。我们将此信息嵌入流行的重采样算法中,允许多个类的信息平衡。我们提出了在Apache Spark上讨论的讨论算法的有效实现,包括克服其前任分布式环境中的空间限制的新颖版本。广泛的实验研究表明,使用实例级信息显着提高了从多级不平衡大数据的学习。我们的框架可以从https://github.com/fsleman/minority-type-imbalanced下载。 (c)2020 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号