【24h】

Deep Learning and Data Sampling with Imbalanced Big Data

机译:大数据不平衡的深度学习和数据采样

获取原文

摘要

This study evaluates the use of deep learning and data sampling on a class-imbalanced Big Data problem, i.e. Medicare fraud detection. Medicare offers affordable health insurance to the elderly population and serves more than 15% of the United States population. To increase transparency and help reduce fraud, the Centers for Medicare and Medicaid Services (CMS) have made several data sets publicly available for analysis. Our research group has conducted several studies using CMS data and traditional machine learning algorithms (non-deep learning), but challenges associated with severe class imbalance leave room for improvement. These previous studies serve as baselines as we employ deep neural networks with various data-sampling techniques to determine the efficacy of deep learning in addressing class imbalance. Random over-sampling (ROS), random under-sampling (RUS), and combinations of the two (ROS-RUS) are applied to study how varying levels of class imbalance impact model training and performance. Classwise performance is maximized by identifying optimal decision thresholds, and a strong linear relationship between minority class size and optimal threshold is observed. Results show that ROS significantly outperforms RUS, combining RUS and ROS both maximizes performance and efficiency with a 4× speedup in training time, and the default threshold of 0.5 is never optimal when training data is imbalanced. To the best of our knowledge, this is the first study to provide statistical results comparing ROS, RUS, and ROS-RUS deep learning methods across a range of class distributions. Additional contributions include a unique analysis of thresholding as it relates to the minority class size and state-of-the-art performance on the given fraud detection task.
机译:这项研究评估了在类不平衡的大数据问题(即医疗保险欺诈检测)上深度学习和数据采样的使用。 Medicare为老年人群提供负担得起的健康保险,并为美国15%以上的人口服务。为了增加透明度并帮助减少欺诈,医疗保险和医疗补助服务中心(CMS)已公开提供了多个数据集以供分析。我们的研究小组使用CMS数据和传统的机器学习算法(非深度学习)进行了多项研究,但是与严重的班级失衡相关的挑战尚有待改进。由于我们采用具有各种数据采样技术的深度神经网络来确定深度学习在解决班级失衡方面的功效,因此这些先前的研究可以作为基准。随机过采样(ROS),随机欠采样(RUS)和两者的组合(ROS-RUS)用于研究类别不平衡水平的变化如何影响模型的训练和性能。通过识别最佳决策阈值,可以最大程度地提高分类性能,并观察到少数族裔人数与最佳阈值之间存在很强的线性关系。结果表明,ROS明显优于RUS,结合RUS和ROS可以使性能和效率最大化,并且训练时间加快4倍,并且当训练数据不平衡时,默认阈值0.5永远不是最佳的。据我们所知,这是第一项提供统计结果的研究,该结果在各种类别分布中比较了ROS,RUS和ROS-RUS深度学习方法。其他贡献包括阈值的独特分析,因为阈值与少数群体的人数和给定欺诈检测任务的最新性能有关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号