...
首页> 外文期刊>Frontiers in Public Health >Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
【24h】

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis

机译:健康记录分析中极其不平衡和小型少数少数群体问题的过度和欠抽样方法

获取原文

摘要

A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.
机译:由于医疗系统的数字化最近的进展,已经存储了相当数量的健康记录(HR)数据。然而,分析HR数据并不总是容易,特别是当与人口相比的靶疾病的人数太小时。这种情况称为不平衡的数据问题。过度采样和欠采样是两种方法,用于纠正少数群体和多数示例之间的不平衡,这可以组合成集合算法。然而,当少数群体示例的绝对数量小时,这些方法不起作用,这被称为极其不平衡和小少数群体(EISM)数据问题。本工作提出了一种称为升值的新算法与启发式欠采样和基于分布的采样(Husdos-Boost)相结合以解决Eism数据问题。为了从原始的不平衡数据集中制作人工平衡的数据集,Husdos-Boost使用下采样和过采样来基于先前提升结果来消除冗余多数示例,并通过遵循少数群体分布来生成人造少数群体示例。通过应用于八个不平衡数据集来评估Husdos-Boost的性能和特性。此外,该算法应用于原始临床HR数据,以检测胃癌患者。这些结果表明,Husdos-Boost优于电流不平衡数据处理方法,特别是当数据是Eism时。因此,所提出的Husdos-Boost是HR数据分析的有用方法。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号