Deep Learning and Data Sampling with Imbalanced Big Data

机译：大数据不平衡的深度学习和数据采样

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This study evaluates the use of deep learning and data sampling on a class-imbalanced Big Data problem, i.e. Medicare fraud detection. Medicare offers affordable health insurance to the elderly population and serves more than 15% of the United States population. To increase transparency and help reduce fraud, the Centers for Medicare and Medicaid Services (CMS) have made several data sets publicly available for analysis. Our research group has conducted several studies using CMS data and traditional machine learning algorithms (non-deep learning), but challenges associated with severe class imbalance leave room for improvement. These previous studies serve as baselines as we employ deep neural networks with various data-sampling techniques to determine the efficacy of deep learning in addressing class imbalance. Random over-sampling (ROS), random under-sampling (RUS), and combinations of the two (ROS-RUS) are applied to study how varying levels of class imbalance impact model training and performance. Classwise performance is maximized by identifying optimal decision thresholds, and a strong linear relationship between minority class size and optimal threshold is observed. Results show that ROS significantly outperforms RUS, combining RUS and ROS both maximizes performance and efficiency with a 4× speedup in training time, and the default threshold of 0.5 is never optimal when training data is imbalanced. To the best of our knowledge, this is the first study to provide statistical results comparing ROS, RUS, and ROS-RUS deep learning methods across a range of class distributions. Additional contributions include a unique analysis of thresholding as it relates to the minority class size and state-of-the-art performance on the given fraud detection task.

机译：这项研究评估了在类不平衡的大数据问题（即医疗保险欺诈检测）上深度学习和数据采样的使用。 Medicare为老年人群提供负担得起的健康保险，并为美国15％以上的人口服务。为了增加透明度并帮助减少欺诈，医疗保险和医疗补助服务中心（CMS）已公开提供了多个数据集以供分析。我们的研究小组使用CMS数据和传统的机器学习算法（非深度学习）进行了多项研究，但是与严重的班级失衡相关的挑战尚有待改进。由于我们采用具有各种数据采样技术的深度神经网络来确定深度学习在解决班级失衡方面的功效，因此这些先前的研究可以作为基准。随机过采样（ROS），随机欠采样（RUS）和两者的组合（ROS-RUS）用于研究类别不平衡水平的变化如何影响模型的训练和性能。通过识别最佳决策阈值，可以最大程度地提高分类性能，并观察到少数族裔人数与最佳阈值之间存在很强的线性关系。结果表明，ROS明显优于RUS，结合RUS和ROS可以使性能和效率最大化，并且训练时间加快4倍，并且当训练数据不平衡时，默认阈值0.5永远不是最佳的。据我们所知，这是第一项提供统计结果的研究，该结果在各种类别分布中比较了ROS，RUS和ROS-RUS深度学习方法。其他贡献包括阈值的独特分析，因为阈值与少数群体的人数和给定欺诈检测任务的最新性能有关。

著录项

来源
《International Conference on Information Reuse and Integration for Data Science》|2019年|175-183|共9页
会议地点 Los Angeles(US)
作者
Justin M. Johnson; Taghi M. Khoshgoftaar;
展开▼
作者单位

Florida Atlantic University;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Deep learning; Training; Medical services; Big Data; Biomedical imaging; Neural networks; Data models;

机译：深度学习；训练;医疗服务;大数据;生物医学成像；神经网络;资料模型;

相似文献

外文文献
中文文献
专利

1. The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data [J] . Justin M. Johnson, Taghi M. Khoshgoftaar Information systems frontiers . 2020,第5期

机译：数据采样与深度学习和高度不平衡大数据的影响
2. HSDLM: A Hybrid Sampling With Deep Learning Method for Imbalanced Data Classification [J] . Hasib Khan Md, Towhid Nurul Akter, Islam Md Rafiqul International journal of cloud applications and computing . 2021,第4期

机译：HSDLM：具有深入学习方法的混合采样，用于实施数据分类
3. Hybrid geometric sampling and AdaBoost based deep learning approach for data imbalance in E-commerce [J] . Sunita Dhote, Chandan Vichoray, Rupesh Pais, Electronic Commerce Research . 2020,第2期

机译：电子商务中数据不平衡的混合几何采样与基于Adaboost的深度学习方法
4. Deep Learning and Data Sampling with Imbalanced Big Data [C] . Justin M. Johnson, Taghi M. Khoshgoftaar International Conference on Information Reuse and Integration for Data Science . 2019

机译：深度学习和数据采样，具有不平衡的大数据
5. Deep Learning Based Imbalanced Data Classification and Information Retrieval for Multimedia Big Data [D] . Yan, Yilin. 2018

机译：基于深度学习的多媒体大数据不平衡数据分类与信息检索
6. Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets [O] . Der-Chiang Li, Susan C. Hu, Liang-Sian Lin, -1

机译：检测代表性数据并生成合成样本以提高不平衡数据集的学习准确性
7. Data Anonymization Using Imbalanced Data for Deep Learning with Uppersampling and Undersampling [O] . Ayahiko Niimi 2019

机译：使用Uppers采样和欠采样的深度学习的数据匿名化匿名化

Deep Learning and Data Sampling with Imbalanced Big Data

摘要

著录项

相似文献

相关主题

期刊订阅