首页> 外文期刊>Journal of Animal Science >Handling of missing data to improve the mining of large feed databases.
【24h】

Handling of missing data to improve the mining of large feed databases.

机译:处理丢失的数据以改善大型提要数据库的挖掘。

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Feed databases often have missing data. Despite their potentially major effect on data analysis (e.g., as a source of biased results and loss of statistical power), database managers and nutrition researchers have paid little attention to missing data. This study evaluated various methods of handling missing data using mining outputs from a database containing data on chemical composition and nutritive value for 18,864 alfalfa samples. A complete reference dataset was obtained comprising the 2,303 cases with no missing data for the attributes CP, crude fiber (CF), NDF, ADF and ADL. This dataset was used to simulate 2 types of missing data (at random and not at random), each with 2 loss intensities (33 and 66%), thus yielding a total of 4 incomplete datasets. Missing data from these datasets were handled using 2 deletion methods and 4 imputation methods, and outputs in terms of the identification and typing of alfalfa (using ANOVA and descriptive statistics) and of correlations between attributes (using regressions) were compared with outputs from the complete dataset. Imputation methods, particularly model-based versions, were found to perform better than deletion methods in terms of maximizing information use and minimizing bias although the extent of differences between methods depended on the type of missing data. The best approximation to the uncertainty value was provided by multiple imputation methods. It was concluded that the choice of the most suitable method for handling missing data depended both on the type of missing data and on the purpose of data analysis.
机译:Feed数据库通常缺少数据。尽管数据库管理人员和营养研究人员可能对数据分析产生重大影响(例如,作为有偏见的结果和失去统计能力的来源),但他们对丢失的数据很少关注。这项研究评估了使用数据库中包含18864种苜蓿样品的化学成分和营养价值数据的数据库的挖掘输出来处理丢失数据的各种方法。获得了一个完整的参考数据集,包括2,303个案例,其中没有缺少CP,粗纤维(CF),NDF,ADF和ADL属性的数据。该数据集用于模拟2种类型的丢失数据(随机和非随机),每种类型都有2种丢失强度(33%和66%),因此总共产生4个不完整的数据集。这些数据集中的缺失数据使用2种删除方法和4种插补方法进行处理,并将苜蓿的鉴定和分型(使用ANOVA和描述性统计数据)和属性之间的相关性(使用回归)的输出与完整数据的输出进行比较。数据集。发现在最大程度地利用信息和最小化偏差方面,插补方法(尤其是基于模型的方法)的性能要优于删除方法,尽管方法之间的差异程度取决于丢失的数据的类型。不确定度值的最佳近似由多种插补方法提供。结论是,选择最合适的方法来处理丢失的数据取决于丢失的数据的类型和数据分析的目的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号