Learning on extremes - size and imbalance - of data .

机译：学习极端数据和不平衡数据。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The availability of extremely large datasets has opened avenues for application of distributed and/or parallel learning. Our approach to learning from such large datasets is to utilize all the training data by learning classifiers on manageable subsets of data. Using a voting mechanism, the predictions of the individual classifiers can be combined. Our experiments with decision trees and neural networks indicate that, in applications involving use of massive datasets, the simple approach of creating a committee of classifiers from disjoint partitions results in a fast and accurate classifier. The reduced complexity associated with the technique of producing random disjoint partitions makes it attractive for creating classifiers on extremely large datasets. We also propose a distributed version of pasting small votes, and achieve classification accuracy comparable to the sequential version as well as boosting. Distributed pasting of small votes is scalable and fast, even with very large datasets. We also highlight the inter-play between unstable or stable methods of learning classifiers and the diversity of classifiers.; A dataset is imbalanced if the classification categories are not approximately equally represented. Usually real-world datasets are predominantly composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. Often the cost of misclassifying an abnormal (interesting) example as a normal example is much higher than the cost of the reverse error. Under-sampling of the majority (normal class) has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. We show that a combination of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance than only under-sampling the majority class. Our method of over-sampling the minority class involves creating synthetic minority class samples. Experiments are performed using C4.5, Ripper, and the Naive Bayes Classifier. We also show that our combination of over-sampling and under-sampling performs better than varying loss ratios in Ripper or by varying the class priors in the Naive Bayes Classifier: the methods that could directly handle skewed class distributions. We evaluate our approaches using ROC (AUC, ROC convex hull) analyses.

机译：巨大的数据集的可用性为分布式和/或并行学习的应用开辟了道路。我们从如此大的数据集中学习的方法是通过学习可管理数据子集上的分类器来利用所有训练数据。使用表决机制，可以合并各个分类器的预测。我们对决策树和神经网络的实验表明，在涉及使用海量数据集的应用中，从不相交的分区创建分类委员会的简单方法可以实现快速准确的分类。与产生随机不相交分区的技术相关的降低的复杂性使其对于在非常大的数据集上创建分类器具有吸引力。我们还提出了粘贴小票的分布式版本，并实现了与顺序版本和增强版相当的分类精度。即使具有非常大的数据集，小票的分布式粘贴也是可伸缩且快速的。我们还强调了学习分类器的不稳定方法或稳定方法与分类器多样性之间的相互作用。如果分类类别未大致相等地表示，则数据集不平衡。通常，真实世界的数据集主要由“正常”示例组成，而只有很少一部分“异常”或“有趣”示例。通常，将异常（有趣的）示例错误分类为普通示例的成本要远远高于反向错误的成本。提议对多数（正常类别）进行欠采样是提高分类器对少数类别敏感度的一种好方法。我们表明，与仅对多数类进行欠采样相比，对少数（异常）类进行过度采样和对多数（正常）类进行欠采样的组合可以获得更好的分类器性能。我们对少数群体类别进行过度采样的方法包括创建合成的少数群体样本。使用C4.5，Ripper和Naive Bayes分类器进行实验。我们还表明，过采样和欠采样的组合比在Ripper中改变丢失率或通过更改朴素贝叶斯分类器中的类先验更好地表现出更好的效果：这些方法可以直接处理偏斜的类分布。我们使用ROC（AUC，ROC凸包）分析评估我们的方法。

著录项

作者
Chawla, Nitesh Vijay.;
展开▼
作者单位

University of South Florida.;

展开▼
授予单位 University of South Florida.;
学科 Computer Science.
学位 Ph.D.
年度 2002
页码 158 p.
总页数 158
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Active Learning From Imbalanced Data: A Solution of Online Weighted Extreme Learning Machine [J] . Yu Hualong, Yang Xibei, Zheng Shang, Neural Networks and Learning Systems, IEEE Transactions on . 2019,第4期

机译：从不平衡数据主动学习：在线加权极限学习机的解决方案
2. Distributed and weighted extreme learning machine for imbalanced big data learning [J] . Zhiqiong Wang, Junchang Xin, Hongxu Yang, Tsinghua Science and Technology . 2017,第2期

机译：分布式加权加权极限学习机，用于大数据不均衡学习
3. Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning [J] . Zhiqiong Wang, Junchang Xin, Hongxu Yang, 清华大学学报（英文版） . 2017,第002期

机译：分布式加权加权极限学习机
4. An Imbalanced Data Classification Method Based on Multi-Kernel Extreme Learning Machine Fusion [C] . Ruifeng Li, Yanli Sun, Hongchun Wang IEEE International Conference of Safe Production and Informatization . 2020

机译：一种基于多核极限学习机融合的不平衡数据分类方法
5. Learning in extreme conditions: Online and active learning with massive, imbalanced and noisy data. [D] . Ertekin, Seyda. 2009

机译：极端条件下的学习：具有大量，不平衡且嘈杂的数据的在线和主动学习。
6. Smartwatch-Based Eating Detection: Data Selection for Machine Learning from Imbalanced Data with Imperfect Labels [O] . Simon Stankoski, Marko Jordan, Hristijan Gjoreski, 2021

机译：基于SmartWatch的进食检测：从具有不完美标签的存储数据的机器学习的数据选择
7. Distributed and weighted extreme learning machine for imbalanced big data learning [O] . Zhiqiong Wang, Junchang Xin, Hongxu Yang, 2017

机译：分布式和加权极限学习机，用于实施大数据学习
8. Methods to Address Extreme Class Imbalance in Machine Learning Based Network Intrusion Detection Systems. [R] . Walter, R. W. 2016

机译：解决基于机器学习的网络入侵检测系统中极端类不平衡的方法。

Learning on extremes - size and imbalance - of data .

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅