首页> 外文会议>Annual International Conference of the IEEE Engineering in Medicine and Biology Society >Data Quality Improvement of a Multicenter Clinical Trial Dataset
【24h】

Data Quality Improvement of a Multicenter Clinical Trial Dataset

机译:多中心临床试验数据集的数据质量改进

获取原文

摘要

Medical datasets are usually affected by several problems, such as missing values, inconsistencies, redundancies, that can influence the data mining process and the extraction of useful knowledge. For these reasons, a preprocessing phase should be performed for improving the overall quality of data and, consequently, of the information that may be discovered from them. In this study we applied five steps of data preprocessing to improve the quality of a large dataset derived from a multicenter clinical trial. Our dataset included 298 patients enrolled in a prospective, multicenter, clinical trial, characterized by 22 input variables and one class variable (MIPI value). In particular, data coming from different medical centers were firstly integrated to obtain a homogeneous dataset. The latter was normalized to scale all variables into smaller and similar intervals. Then, all missing values were estimated by means of an imputation step. The complete dataset was finally discretized and reduced to remove redundant variables and decrease the amount of data to be managed. The improvement of data quality after each step was evaluated by means of the patients' classification accuracy using the KNN classifier. Our results showed that the proposed pipeline produced an increment of more than 20% of the classification performances. Moreover, the highest growth of accuracy was obtained after missing value imputation, whereas the discretization and feature selection steps allowed for a significant reduction of variables to be managed, without any deterioration of the information contained in data.
机译:医疗数据集通常受几个问题的影响,例如缺失值,不一致,冗余,可以影响数据挖掘过程和有用知识的提取。由于这些原因,应执行预处理阶段以提高数据的整体质量,并且因此,可以从它们中发现的信息。在这项研究中,我们应用了五个步骤的数据预处理,以提高来自多中心临床试验的大型数据集的质量。我们的数据集包括注册前瞻性,多中心,临床试验的298名患者,其特征在于22个输入变量和一个类变量(MIPI值)。特别是,首先集成来自不同医疗中心的数据以获得均匀的数据集。后者被标准化以将所有变量放入更小和相似的间隔。然后,借助于归一步骤估计所有缺失值。完整的数据集最终是离散化的,并减少以删除冗余变量并减少要管理的数据量。通过使用KNN分类器的患者的分类精度评估每个步骤后的数据质量的提高。我们的研究结果表明,该拟议的管道增量超过分类表演的20%。此外,在缺失值归档之后获得了最高的精度的增长,而允许的离散化和特征选择步骤进行显着减少要管理的变量,而不会对数据中包含的信息的劣化进行任何恶化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号