首页> 外文期刊>BMC Medical Informatics and Decision Making >Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
【24h】

Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines

机译:在支持向量机应用之前,通过数据平衡和特征选择来增强不平衡常规病理数据中肝炎病毒免疫测定结果的预测

获取原文
       

摘要

Background Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. Methods The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. Results Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. Conclusions Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.
机译:背景技术诸如支持向量机(SVM)之类的数据挖掘技术已成功用于预测复杂问题(包括人类健康)的结果。许多健康数据失衡,与阳性病例相比,控制得多。方法探讨三种平衡方法和一种特征选择方法的影响,以评估SVM对与实验室诊断乙型肝炎(HBV)和丙型肝炎(HCV)感染相关的不平衡诊断病理数据进行分类的能力。检查了用于预测变量选择的随机森林(RF),以及为了克服与HBV和HCV免疫测定结果有关的阴性与阳性测试结果之间的巨大不平衡而进行的数据重塑。使用ACT病理学(澳大利亚堪培拉)的数据对方法进行了说明,该数据由1997年至2007年这十年间来自18,625例接受肝炎病毒检测的个人的实验室检测记录组成。结果总体而言,通过免疫测定对HCV检测结果的预测更为准确与相同的常规病理预测变量数据相关的HBV免疫测定结果相比。 HBV和HCV阴性结果远远超过阳性结果,因此比较了处理阴性/阳性数据失衡的三种方法。通过合成少数族裔过采样技术(SMOTE)生成数据集,与数据集的单次精简或多次精简(MDS)相比,预测结果要准确得多。对于缩小的数据集,将RF应用于预测变量选择对​​性能的影响很小,具体取决于病毒。对于SMOTE,RF对性能有负面影响。跨设置的性能差异分析支持了这些发现。最后,使用优化的SVM模型,发现丙氨酸转氨酶(ALT),乙肝病毒钠盐和丙肝病毒尿素的年龄和化验结果对实验室诊断乙肝病毒或丙肝病毒感染具有重要影响。结论希望将通过SVM进行机器学习作为决策支持的一部分的实验室需要意识到,平衡方法,预测变量选择和病毒类型会相互作用,从而通过常规病理实验室变量以不同方式影响肝炎病毒感染的实验室诊断,具体取决于正在研究哪种组合。这种认识应导致谨慎使用现有的机器学习方法,从而提高实验室诊断的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号