首页> 外文OA文献 >Application of data mining techniques in the prediction of coronary artery disease : use of anaesthesia time-series and patient risk factor data
【2h】

Application of data mining techniques in the prediction of coronary artery disease : use of anaesthesia time-series and patient risk factor data

机译:数据挖掘技术在冠状动脉疾病预测中的应用:麻醉时间序列和患者危险因素数据的使用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
机译:随着人们对危险因素的了解,有效的预防措施和行之有效的治疗干预措施的发展,与动脉粥样硬化性冠状动脉疾病(CVD)及其并发症相关的高发病率和高死亡率得到了缓解。但是,仍然存在明显的CVD发病率,而心脏猝死仍然是随后诊断为CVD的某些患者的主要表现。冠状血管疾病也是麻醉相关并发症的主要原因。压力心电图/运动测试可预测10年的CVD事件风险,并且围手术期监测用于对该测试评分的心血管变量。相似的生理时间序列数据集正在接受数据挖掘方法,以预测医学诊断和结果。本研究旨在利用麻醉时间序列数据和患者危险因素数据寻找CVD的预测因子。几种预处理和预测性数据挖掘方法已应用于此数据。与麻醉程序有关的生理时间序列数据要经过预处理方法,以去除异常值,计算移动平均值以及数据汇总和数据抽象方法。包装器和过滤器类型的特征选择方法仅适用于导出的生理时间序列变量集,也适用于与风险因子变量组合的相同变量。评估了这些方法识别高度相关但非冗余变量子集的能力。主要数据集来自整个麻醉人群,并且由于需要更深入的监测(有创血流动力学监测和其他ECG线索),该人群的子集被认为处于增加的麻醉风险中。由于数据中类别的不平衡分布,多数类别的欠采样和Kappa统计以及误分类率和ROC曲线下的面积(AUC)被用于评估使用不同预测算法生成的模型。尽管从一致性得出的子集趋于稍微提高准确性,但复杂性却明显增加,但基于从特征减少的数据集得出的模型的性能表明,滤波方法Cfs子集评估最有效。使用分类错误率(MR)进行模型性能评估受类分布的影响。可以通过考虑AUC或Kappa统计信息,以及通过对抽样不足的多数类别的子集进行评估来消除这种情况。噪声和异常值去除预处理方法生成的模型的MR范围从10.69到12.62,其中最低值用于同时去除异常值和噪声的数据(MR 10.69)。对于原始时间序列数据集,MR为12.34。特征选择会导致MR降低到9.8到10.16,其中时间分段摘要数据(数据集F)MR为9.8,原始时间序列摘要数据(数据集A)为9.92。但是,对于所有仅基于时间序列的数据集,复杂度很高。对于大多数预处理方法,Cfs可以从单独的时间序列数据集中识别出相关和非冗余变量的子集,但是从这些子集派生的模型仅是一片叶子。 MR值与在n交叉验证方法中评估的子集折叠中的类分布一致。对于基于Cfs选择的时间序列派生变量和风险因子(RF)变量的模型,MR范围为8.83至10.36,数据集RF_A(原始时间序列数据和RF)为8.85,数据集RF_F(时间分段时间序列变量和RF)为9.09。基于异常值和正常范围之外的数据点计数的模型(Dataset RF_E),以及基于使用符号聚合近似(SAX)转换的时间序列和相关时间序列模式簇成员资格(Dataset RF_ G)转换的基于导出变量的模型MR分别为10.25和10.36。对于冠状动脉疾病的预测,最近邻(NNge)和基于支持向量机的方法SMO的MR最高,分别为10.1和10.28,而逻辑回归(LR)和决策树(DT)方法的J48,MR则为8.85。和9.0。 DT规则最容易理解,并且与临床相关。通过将风险因子变量添加到基于时间序列变量的模型中,可以提高预测准确性,这是非常重要的。仅基于风险因素变量将时间序列派生变量添加到模型中,这与改进性能的趋势相关。减少特征的数据挖掘,麻醉时间序列变量以及危险因素变量可以生成紧凑且中等准确的模型,能够预测冠状动脉疾病。时间序列数据的决策树分析与风险因素变量相结合,得出的规则比仅基于时间序列数据的模型更准确。与仅使用危险因素相比,由心电图变量提供的有限附加值类似于最近的建议,即在标准化条件下进行运动心电图(exECG)相对于危险因素分析和症状模式,其附加诊断价值受到限制。当将时间序列变量和风险因子变量用作模型输入时,本研究中使用的预处理效果有限。在没有风险因素输入的情况下,离群值去除后使用时间序列变量以及基于生理变量值超出可接受的正常范围的时间序列变量可提高模型性能。

著录项

  • 作者

    Pitt Ellen Alexandra;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号