首页> 外文会议>Medical Technologies National Congress >A Missing Data Imputation Approach Using Clustering and Maximum Likelihood Estimation
【24h】

A Missing Data Imputation Approach Using Clustering and Maximum Likelihood Estimation

机译:使用聚类和最大似然估计的缺少数据估算方法

获取原文

摘要

Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the "Mesothelioma" (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved.
机译:缺少数据是数据挖掘问题,其出于各种原因,对医疗数据中经常遇到的数据分析和决策过程产生了不利影响。缺少数据仍然是一个重要的研究主题,因为该方法的成功受到许多因素的影响,例如数据的特征和缺失数据的类型。在本研究中,提出了基于缺失数据问题的基于群集和最大似然估计(MLE)的方法。为了测试所提出的方法,使用了DICE University Medical School准备并上传到UCI国际开源数据库的“间皮瘤”(间皮瘤)数据集。已经创建了与缺失的数据模式兼容的新数据集,例如随机丢失(MCAR),随机(MAR)丢失,并且缺少随机(MNAR)。在第二步中,这些新数据集被划分为群集,以便通过K-Means群集过程增加MLE方法的计算成功,其中不包括具有缺失数据的3个功能。在最后一步中,使用MLE方法完成缺失的数据,用于这些群集的MLE方法,其中再次添加具有缺失值的特征,并且将群集合并以获取完整的数据集。与根据均方根误差(RMSE)标准的原始数据集(RMSE)标准进行了三个步骤(数据减少,群集和数据完成)而获得的新数据集,平均成功为96.5%已实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号