...
首页> 外文期刊>Data technologies and applications >A systematic review of machine learning-based missing value imputation techniques
【24h】

A systematic review of machine learning-based missing value imputation techniques

机译:基于机器学习的系统评价缺失值归责技术

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010-2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method? Design/methodology/approach The review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions. Findings This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge. Originality/value It is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.
机译:目的本研究的主要目的是审查从不同的维度包括研究类型的方法,实验设置和评价指标中使用的新方法提出了数据归责,特别是在机器学习(ML)的区域。提供了一个了解的拟议的框架是什么类型和评估比missingness得到解决建议。(1)什么是ML-based归责方法研究并提出了2010 - 2020年期间?实验设置的特点数据集和missingness受雇于这些研究吗?归责方法的评价?设计/方法/方法评审过程通过标准的识别,筛选和选择的过程。为缺失值搜索电子数据库归咎(本研究)基于ML算法返回大量的论文共计2883。论文在这个阶段是不完全的本研究与本研究相关的技术。文学评论是第一扫描标题相关性,和306年的文学评论确定为适当的。抽象的文字,151年文学评论的不符合这个研究。导致155研究论文适合全文综述。在评估审查的问题。这项研究表明,集群,提出了基于实例的算法是最本研究方法。(PCP)和均方根误差(RMSE)在这些研究中使用的评价指标。实验,大多数的研究来源从公开的数据集的数据集存储库。完整的数据集设置为基线评估归责在测试数据的有效性与人工诱导missingness集。数据集的大小和missingness比例不同数据类型和实验,而失踪机制有关的能力归责。和实验使用大型数据集出现是一个挑战。从审查,没有理解单一的普遍缺失的数据解决方案问题。missingness基于的特点数据集。推广方面的适用性。另一个担忧是与适用性制定和实施的复杂性的算法。邻居(资讯)和聚类算法很简单,容易实现受欢迎吗在各种领域。

著录项

相似文献

  • 外文文献
  • 中文文献
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号