...
首页> 外文期刊>Studies in Health Technology and Informatics >Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database
【24h】

Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

机译:缺少数据的统计分析工具:在大型医学数据库中的应用

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Missing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: (1) a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps 1 to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures.
机译:丢失数据是大型数据集(尤其是医学数据集)的普遍特征。根据统计分析的目标,可以使用各种技术来解决此问题。估算方法包括用合理的或预测的值替换缺失值,以便随后可以使用任何选择的数据挖掘程序来分析完整的数据。在这项工作中,我们研究多元数据背景下的插补,并评估了当今标准统计软件包可以使用的许多方法。使用模拟数据和具有众多缺失值的大型医学数据库(来自糖尿病领域),比较了使用多元分类进行的插补,通过因素分析进行​​的插补和插补。我们的主要结果是提供一个控制图,用于评估插补过程后的数据质量。为此,我们开发了一种算法,其输入为一组描述基础数据的参数(例如协方差矩阵,分布),输出为图表,该图表绘制了预测误差相对于丢失比例的变化价值观。该图表通过涉及四个步骤的迭代算法构建:(1)使用输入参数绘制模拟数据样本; (2)缺失值是随机生成的; (3)使用插补方法来填充丢失的数据,并且(4)计算预测误差。重复步骤1至4,以估计预测误差的分布。假设数据是多元正态分布,则针对此处研究的3种插补方法建立了控制图。然后研究了在大型医学数据库中使用此工具的情况。我们将展示如何在数据挖掘程序上游的预处理步骤中使用控制图评估插补过程的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号