首页> 外文期刊>Applied Artificial Intelligence >AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES
【24h】

AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

机译:使用决策树处理不完整数据的技术的实证比较

获取原文
获取原文并翻译 | 示例

摘要

Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This article investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data for different patterns of missing data-different proportions and mechanisms of missing data on resulting tree-based models. The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using 21 complete datasets (i.e., with no missing values) obtained from the University of California, Irvine repository of machine-learning databases (Blake and Merz 1998). A four-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques, while multiple imputation, that utilizes the expectation maximization algorithm, represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more, severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputation-versus model-based imputation procedures gave-reasonably good results although some discrepancies remained. Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern, and mechanisms of missing data, especially the latter two. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques, such as multiple imputation and supervised learning imputation.
机译:对不完整数据如何影响学习和分类准确性的认识的提高导致丢失数据技术的数量增加。本文研究了七种流行的技术的鲁棒性和准确性,这些技术可用于针对数据丢失比例不同的不同模式的不同模式的不完整训练和测试数据,以及在基于树的模型上丢失数据的机制。通过使用从加利福尼亚大学尔湾机器学习数据库(布雷克和梅尔兹)存储库中获得的21个完整数据集(即没有缺失值),通过人工模拟不同比例,模式和缺失数据机制,对这7种缺失数据技术进行了比较。 1998)。采用四向重复测量设计来分析数据。仿真结果表明存在重要差异。所有方法都有其优点和缺点。但是,按列表删除实质上不如其他六种技术,而利用期望最大化算法的多重插补代表了一种处理不完整数据的高级方法。与仅位于单个属性上的情况相比,决策树单一插补和代理变量的拆分受到分配在所有属性之间的缺失值的影响更大。否则,尽管仍然存在一些差异,但基于插补与模型的插补程序给出了合理的良好结果。当使用决策树时,用于解决缺失值的不同技术可能会产生实质上不同的结果,必须仔细考虑以防止偏差和虚假发现。应始终使用多重插补,尤其是在数据包含许多缺失值的情况下。如果缺少几个值,则可以考虑所有丢失的数据技术。技术的选择应遵循丢失数据的比例,模式和机制,尤其是后两者。但是,鉴于可访问性和易于使用更高级的技术(例如多重插补和有监督的学习插补),使用列表删除和均值或众数单一插补等较老的技术不再合理。

著录项

  • 来源
    《Applied Artificial Intelligence》 |2009年第5期|373-405|共33页
  • 作者

    Bhekisipho Twala;

  • 作者单位

    Modelling and Digital Intelligence, CSIR, P.O. Box 395, Pretoria 0001, South Africa;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号