首页> 外文期刊>JMIR Medical Informatics >Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing
【24h】

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

机译:使用卫生保健中的合成数据的监督机器学习的可靠性:用于保护数据共享隐私的模型

获取原文
       

摘要

Background: The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods: A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results: A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.
机译:背景:医疗保健中的合成数据的利用处于早期阶段。合成数据可以解锁对释放过于敏感的医疗保健数据集中的潜力。迄今为止已经开发了几种合成数据发生器;然而,评估它们的疗效和普遍性的研究是稀缺的。目的:这项工作列出了了解对合成数据培训的监督机器学习模型的性能差异与实际数据训练的人相比。方法:选择共19个开放式健康数据集进行实验工作。使用三种合成数据发生器生成合成数据,用于应用分类和回归树,参数和贝叶斯网络方法。使用真实和合成数据(单独)培训五种监督机器学习模型:随机梯度下降,决策树,k最近邻居,随机林和支持向量机。仅在实际数据上测试模型,以确定是否通过培训合成数据开发的模型可以用于准确分类新的实际示例。还评估了统计披露控制对模型性能的影响。结果:共有92%的合成数据培训的型号比实际数据培训的精度较低。基于树的型号在合成数据上培训的型号从培训的型号的实际数据培训0.177(18%)至0.193(19%),而其他型号的偏差较低0.058(6%)至0.072(7%)。在培训和测试的实际数据上培训和测试时,获胜分类器在培训的合成数据上进行培训并在实际数据上测试的案例中的分类和回归树和参数合成数据的26%(5/19)和21%(4 / 19)贝叶斯网络生成的合成数据的案例。基于树的模型以实际数据表现最佳,是95%(18/19)的获奖分类器。这不是用于合成数据培训的模型的情况。当不考虑基于树的模型时,实际和合成数据的获胜分类器符合74%(14/19),53%(10/19)和68%(13/19)的分类和回归树,参数和贝叶斯网络合成数据分别。统计披露控制方法对数据实用性没有显着影响。结论:该研究的结果具有很小的型号在用合成数据培训的模型中观察到的精度,与具有真实数据训练的模型相比,两者都在真实数据上进行测试。这种偏差是预期的和可管理的。基于树的分类器对合成数据具有一些敏感性,潜在的原因需要进一步调查。本研究突出了合成数据的潜力以及进一步评估其鲁棒性的需求。合成数据必须确保保留各个隐私和数据实用程序,以便在使用此类数据时向卫生保健部门灌输信心以通知政策决策。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号