首页> 外文学位 >Learning on extremes - size and imbalance - of data .
【24h】

Learning on extremes - size and imbalance - of data .

机译:学习极端数据和不平衡数据。

获取原文
获取原文并翻译 | 示例

摘要

The availability of extremely large datasets has opened avenues for application of distributed and/or parallel learning. Our approach to learning from such large datasets is to utilize all the training data by learning classifiers on manageable subsets of data. Using a voting mechanism, the predictions of the individual classifiers can be combined. Our experiments with decision trees and neural networks indicate that, in applications involving use of massive datasets, the simple approach of creating a committee of classifiers from disjoint partitions results in a fast and accurate classifier. The reduced complexity associated with the technique of producing random disjoint partitions makes it attractive for creating classifiers on extremely large datasets. We also propose a distributed version of pasting small votes, and achieve classification accuracy comparable to the sequential version as well as boosting. Distributed pasting of small votes is scalable and fast, even with very large datasets. We also highlight the inter-play between unstable or stable methods of learning classifiers and the diversity of classifiers.; A dataset is imbalanced if the classification categories are not approximately equally represented. Usually real-world datasets are predominantly composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. Often the cost of misclassifying an abnormal (interesting) example as a normal example is much higher than the cost of the reverse error. Under-sampling of the majority (normal class) has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. We show that a combination of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance than only under-sampling the majority class. Our method of over-sampling the minority class involves creating synthetic minority class samples. Experiments are performed using C4.5, Ripper, and the Naive Bayes Classifier. We also show that our combination of over-sampling and under-sampling performs better than varying loss ratios in Ripper or by varying the class priors in the Naive Bayes Classifier: the methods that could directly handle skewed class distributions. We evaluate our approaches using ROC (AUC, ROC convex hull) analyses.
机译:巨大的数据集的可用性为分布式和/或并行学习的应用开辟了道路。我们从如此大的数据集中学习的方法是通过学习可管理数据子集上的分类器来利用所有训练数据。使用表决机制,可以合并各个分类器的预测。我们对决策树和神经网络的实验表明,在涉及使用海量数据集的应用中,从不相交的分区创建分类委员会的简单方法可以实现快速准确的分类。与产生随机不相交分区的技术相关的降低的复杂性使其对于在非常大的数据集上创建分类器具有吸引力。我们还提出了粘贴小票的分布式版本,并实现了与顺序版本和增强版相当的分类精度。即使具有非常大的数据集,小票的分布式粘贴也是可伸缩且快速的。我们还强调了学习分类器的不稳定方法或稳定方法与分类器多样性之间的相互作用。如果分类类别未大致相等地表示,则数据集不平衡。通常,真实世界的数据集主要由“正常”示例组成,而只有很少一部分“异常”或“有趣”示例。通常,将异常(有趣的)示例错误分类为普通示例的成本要远远高于反向错误的成本。提议对多数(正常类别)进行欠采样是提高分类器对少数类别敏感度的一种好方法。我们表明,与仅对多数类进行欠采样相比,对少数(异常)类进行过度采样和对多数(正常)类进行欠采样的组合可以获得更好的分类器性能。我们对少数群体类别进行过度采样的方法包括创建合成的少数群体样本。使用C4.5,Ripper和Naive Bayes分类器进行实验。我们还表明,过采样和欠采样的组合比在Ripper中改变丢失率或通过更改朴素贝叶斯分类器中的类先验更好地表现出更好的效果:这些方法可以直接处理偏斜的类分布。我们使用ROC(AUC,ROC凸包)分析评估我们的方法。

著录项

  • 作者

    Chawla, Nitesh Vijay.;

  • 作者单位

    University of South Florida.;

  • 授予单位 University of South Florida.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 158 p.
  • 总页数 158
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号