Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

Cristian Preda; Alain Duhamel; Monique Picavet; Tahar Kechadi

首页> 外文期刊>Studies in Health Technology and Informatics >Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

【24h】

Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

机译：缺少数据的统计分析工具：在大型医学数据库中的应用

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Missing data is a common feature of large data sets in general and medical data sets in particular. Depending on the goal of statistical analysis, various techniques can be used to tackle this problem. Imputation methods consist in substituting the missing values with plausible or predicted values so that the completed data can then be analysed with any chosen data mining procedure. In this work, we study imputation in the context of multivariate data and we evaluate a number of methods which can be used by today's standard statistical software packages. Imputation using multivariate classification, multiple imputation and imputation by factorial analysis are compared using simulated data and a large medical database (from the diabetes field) with numerous missing values. Our main result is to provide a control chart for assessing data quality after the imputation process. To this end, we developed an algorithm for which the input is a set of parameters describing the underlying data (e.g., covariance matrix, distribution) and the output is a chart which plots the change in the prediction error with respect to the proportion of missing values. The chart is built by means of an iterative algorithm involving four steps: (1) a sample of simulated data is drawn by using the input parameters; (2) missing values are randomly generated; (3) an imputation method is used to fill in the missing data and (4) the prediction error is computed. Steps 1 to 4 are repeated in order to estimate the distribution of the prediction error. The control chart was established for the 3 imputation methods studied here, assuming a multivariate normal distribution of data. The use of this tool on a large medical database was then investigated. We show how the control chart can be used to assess the quality of the imputation process in the pre-processing step upstream of data mining procedures.

机译：丢失数据是大型数据集（尤其是医学数据集）的普遍特征。根据统计分析的目标，可以使用各种技术来解决此问题。估算方法包括用合理的或预测的值替换缺失值，以便随后可以使用任何选择的数据挖掘程序来分析完整的数据。在这项工作中，我们研究多元数据背景下的插补，并评估了当今标准统计软件包可以使用的许多方法。使用模拟数据和具有众多缺失值的大型医学数据库（来自糖尿病领域），比较了使用多元分类进行的插补，通过因素分析进行的插补和插补。我们的主要结果是提供一个控制图，用于评估插补过程后的数据质量。为此，我们开发了一种算法，其输入为一组描述基础数据的参数（例如协方差矩阵，分布），输出为图表，该图表绘制了预测误差相对于丢失比例的变化价值观。该图表通过涉及四个步骤的迭代算法构建：（1）使用输入参数绘制模拟数据样本；（2）缺失值是随机生成的；（3）使用插补方法来填充丢失的数据，并且（4）计算预测误差。重复步骤1至4，以估计预测误差的分布。假设数据是多元正态分布，则针对此处研究的3种插补方法建立了控制图。然后研究了在大型医学数据库中使用此工具的情况。我们将展示如何在数据挖掘程序上游的预处理步骤中使用控制图评估插补过程的质量。

著录项

来源
《Studies in Health Technology and Informatics》 |2005年第2005期|p.181-186|共6页
作者
Cristian Preda; Alain Duhamel; Monique Picavet; Tahar Kechadi;
展开▼
作者单位

Faculte de Medecine, France;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
statistical models; databases; data mining; missing values; imputation;

机译：统计模型;数据库;数据挖掘;缺失值;输入;

相似文献

外文文献
中文文献
专利

1. A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases. [J] . Konias S, Chouvarda I, Vlahavas I, Medical informatics and the Internet in medicine . 2005,第3期

机译：从具有缺失值处理的数据库中生成增量不确定性规则的新方法：应用于动态医疗数据库。
2. Interval versions of statistical techniques with applications to environmental analysis, bioinformatics, and privacy in statistical databases [J] . Kreinovich V, Longpre L, Starks SA, Journal of Computational and Applied Mathematics . 2007,第1期

机译：统计技术的间隔版本，适用于环境分析，生物信息学和统计数据库中的隐私
3. REGSTATTOOLS: freeware statistical tools for the analysis of disease population databases used in health and social studies [J] . Laura Esteban, Ramon Clèries, Jordi Gálvez, BMC Public Health . 2013,第1期

机译：REGSTATTOOLS：免费软件统计工具，用于分析健康和社会研究中使用的疾病人群数据库
4. Tools for Statistical Analysis with Missing Data Application to a Large Medical Database [C] . Cristian Preda, Alain Duhamel, Monique Picavet, Medical Informatics in Europe Conference. . 2005

机译：缺失数据应用到大型医疗数据库的统计分析工具
5. Use of statistical analysis, data mining, decision analysis and cost effectiveness analysis to analyze medical data: Application to comparative effectiveness of lumpectomy and mastectomy for breast cancer. [D] . Ugiliweneza, Beatrice. 2011

机译：使用统计分析，数据挖掘，决策分析和成本效益分析来分析医学数据：在乳腺癌的乳房切除术和乳房切除术的比较有效性中的应用。
6. Statistical and Probabilistic Analysis. New Tools for Analysis and Representation: XS: An Analysis and Synthesis System for Linear Regression Constructed by Integrating a Graphical Statistical System a Relational Database System and an Expert System Shell [O] . R.S. Johannes, C. Hendricks Brown, Lynn E. Onstad 1989

机译：统计和概率分析。分析和表示的新工具：XS：通过集成图形统计系统关系数据库系统和专家系统外壳构建的线性回归分析和综合系统
7. Analysis of Longitudinal Data with Missing Values.: Methods and Applications in Medical Statistics. [O] . Dragset Ingrid Garli 2009

机译：缺少值的纵向数据分析。：医学统计中的方法和应用。
8. Empirical Analysis of Operation Iraqi Freedom Combat Mortality Using the Navy-Marine Corps Combat Trauma Registry Expeditionary Medical Encounter Database for Applications to Tactical Medical Logistics Modeling and Simulation [R] . Mitchell, R., Parker, J., Galarneau, M., 2009

机译：使用海军 - 海军陆战队战斗创伤登记远征医疗遭遇数据库应用于战术医疗物流建模和模拟的伊拉克自由战死亡行动的实证分析

Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

摘要

著录项

相似文献

相关主题

期刊订阅