首页> 外文期刊>Health Information Science and Systems >RSMOTE: improving classification performance over imbalanced medical datasets
【24h】

RSMOTE: improving classification performance over imbalanced medical datasets

机译:RSMOTE:提高对医疗数据集的分类性能

获取原文
           

摘要

IntroductionMedical diagnosis is a crucial step for patient treatment. However, diagnosis is prone to bias due to imbalanced datasets. To overcome the imbalanced dataset problem, simple minority oversampling technique (SMOTE) was proposed that can generate new synthetic samples at data level to create the balance between minority and majority classes. However, the synthetic samples are generated on a random basis which causes class mixture problem; thus, resulting in deteriorating the classification performance and biased diagnosis.PurposeIn order to overcome the SMOTE shortcomings, some modified methods were proposed that try to generate synthetic samples along the line segment of selected minority samples. Most of these methods adopt one of the two policies for selecting minority samples to generate synthetic samples: borderline region sampling or safe region sampling. However, they both suffer from over-generalisation problem. We propose a modified SMOTE-based resampling method called RSMOTE to alleviate the medical imbalanced dataset problem. We provide an in-depth analysis and verify the performance of RSMOTE over imbalanced medical datasets.MethodsIn this paper, the proposed RSMOTE divides the minority sample domain into four regions (normal, semi-normal, semi-critical, and critical) based on the minority sample density analysis. RSMOTE discovers the minority sample region globally and applies the resampling near a specific group of samples.ResultsOur analysis and experiments verify that if synthetic samples are generated in the regions with high minority sample density, classification performance will be improved due to low risk of class mixture. Unlike some safe region methods, RSMOTE decides the region of minority samples on a global basis, thus removing the over-generalisation problem. Classic and additional evaluation metrics are considered to measure the effectiveness of the modified method: Recall, FP Rate, Precision, F-Measure, ROC area, and Average Aggregated Metric. We carried out experiments over various imbalanced medical datasets.ConclusionBased on the minority sample density analysis, we propose RSMOTE method that divides the minority sample domain into four regions. The proposed RSMOTE includes four re-sampling methods that each of them carries out resampling on a specific region. According to the experimental results, resampling on the regions with high minority sample density obtained better results while those with lower minority sample density got the inferior results. Thus, we conclude that the RSMOTE is a more flexible resampling method for the imbalanced medical datasets that is capable of generating samples with various minority sample densities.
机译:介绍医疗诊断是患者治疗的关键步骤。但是,由于数据集不平衡,诊断易于偏置。为了克服不平衡的数据集问题,提出了简单的少数群体过采样技术(SMOTE),可以在数据级别产生新的合成样本,以在少数群体和多数类之间创造平衡。然而,合成样品是随机产生的,这导致阶级混合问题;因此,导致分类性能和偏见的诊断劣化。提出了一些修饰的方法,以试图沿着选定的少数群体样本的线段产生合成样品。这些方法中的大多数采用两种政策之一,用于选择少数群体样本以产生合成样本:边界区域采样或安全区域采样。然而,它们都遭受过度泛化问题。我们提出了一种被称为RSMote的修改的张先生的重采样方法,以缓解医疗不平衡数据集问题。我们提供深入的分析和验证RSMote对不平衡医疗数据集的性能。本文提出的RSMote将少数群体样本域分为四个地区(正常,半正常,半关键和批评)。少数群体样本密度分析。 RSMote在全球中发现少数群体样本区域,并在特定的样本组附近应用重采样。验证和实验验证,如果在具有高少数群体样本密度的区域中产生的,则由于阶级混合物的风险低,将改善分类性能。与一些安全区域方法不同,RSMOTE在全球范围内决定少数群体样本区域,从而消除过度泛化问题。经典和额外的评估指标被认为是测量修改方法的有效性:召回,FP速率,精度,F测量,ROC区域和平均聚合度量。我们对各种不平衡的医疗数据集进行了实验。在少数群体样本密度分析上结论,我们提出了将少数群体样本域分为四个地区的RSMote方法。所提出的RSMote包括四种重新采样方法,每个重新采样方法都在特定区域上进行重新采样。根据实验结果,在具有高少数群体样品密度的区域上重新采样,获得了较低的少数样本密度的结果,得到了较差的结果。因此,我们得出得出结论,RSMote是一种更灵活的重采样方法,适用于能够产生具有各种少数样本密度的样本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号