首页> 外文OA文献 >Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases)
【2h】

Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases)

机译:部分合成的数据集可提高预测准确性(案例研究:心脏病的预测)

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The real world data sources, such as statistical agencies, library data-banks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85% of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been con-ducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor.
机译:诸如统计机构,图书馆数据库和研究机构等现实世界的数据源是研究人员的主要数据源。使用此类数据涉及多个优势,其中包括提高实验的可信度和有效性,更重要的是,它与现实世界中的问题相关,通常没有偏见。但是,由于以下原因,每个人都很可能无法使用或无法访问此类数据。首先,隐私和机密性问题,因为必须在法律和道德基础上保护数据。第二,收集现实世界的数据既昂贵又费时。第三,数据可能不可用,尤其是在新出现的研究对象中。因此,许多研究归因于创建的简单性,使用了完全和/或部分合成的数据而不是真实世界的数据,需要相对较少的时间,并且可以生成足够的数量来满足要求。在这种情况下,本研究介绍了使用部分合成的数据来改善根据危险因素预测心脏病的方法。我们建议使用基于规则的方法根据公认的原则生成部分合成的数据,其中将额外的风险因素添加到现实世界的数据中。在进行的实验中,超过85%的数据来自观察值(即真实世界的数据),而其余数据是使用基于规则的方法并根据世界卫生组织的标准综合生成的。分析显示,使用部分合成数据的前两个主要成分可以改善数据方差。使用五个流行的有监督的机器学习分类器进行了进一步的评估。其中,部分合成的数据大大改善了心脏病的预测。大多数分类器使用额外的风险因素将其预测性能提高了近一倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号