首页> 外文期刊>Information systems frontiers >The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
【24h】

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

机译:数据采样与深度学习和高度不平衡大数据的影响

获取原文
获取原文并翻译 | 示例
       

摘要

Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%-90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems.
机译:具有类别不平衡数据的培训预测模型已被证明是一项艰巨的任务。这个问题很好,但大数据的时代正在产生更极端的不平衡,越来越难以模型。我们使用三种不同复杂性的数据集来评估数据采样策略,以便与深神经网络和大数据处理高级不平衡。采样率变化,以创建培训分布,尺寸为0.025%-90%。接收器操作特性曲线下的区域用于比较性能,并且阈值用于最大化类性能。随机过度采样(ROS)始终如一地优于取样(RUS)和基线方法。在使用RUS时,大多数阶级都证明了歪曲的影响,结果表明每个数据集对不平衡和样本大小都是唯一敏感的。杂交ROS-RUS最大限度地提高了性能和效率,是我们在大数据问题内治疗高不平衡的首选方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号