首页> 外文OA文献 >Evolutionary undersampling for extremely imbalanced big data classification under apache spark
【2h】

Evolutionary undersampling for extremely imbalanced big data classification under apache spark

机译:在Apache Spark下进行进化性欠采样以实现极不平衡的大数据分类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.
机译:具有偏类分布的数据集的分类是数据挖掘中的重要问题。事实证明,多数阶层的进化式抽样不足是解决此问题的成功方法。当多数阶级的例子很多时,这样一项具有挑战性的任务可能会变得更加困难。在这种情况下,由于内存和时间的限制,使用演化模型变得不切实际。已经提出了基于MapReduce范式的分而治之方法,通过将数据分为多个子集来处理此类问题。但是,在极端不平衡的情况下,这些模型可能会受到所考虑的子集中少数族裔群体缺乏密度的困扰。为了解决这个问题,在此贡献中,我们提供了基于新兴技术Apache Spark的新大数据方案,以解决高度不平衡的数据集。我们利用其内存操作来减少小样本量的影响。该提案的重点在于对多数和少数派示例的独立管理,这使我们可以在每个子集中保留更多的少数派示例。在我们的实验中,我们使用多达1700万个实例的几个数据集分析了提出的模型。结果表明,这种进化的欠采样模型对于极端不平衡的大数据分类是有益的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号