...
首页> 外文期刊>Journal of Statistical Software >Developer-Friendly and Computationally Efficient Predictive Modeling without Information Leakage: The emil Package for R
【24h】

Developer-Friendly and Computationally Efficient Predictive Modeling without Information Leakage: The emil Package for R

机译:开发人员友好且计算有效的预测建模而无信息泄漏:R的emil软件包

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Data driven machine learning for predictive modeling problems (classification, regression, or survival analysis) typically involves a number of steps beginning with data preprocessing and ending with performance evaluation. A large number of packages providing tools for the individual steps are available for R, but there is a lack of tools for facilitating rigorous performance evaluation of the complete procedures assembled from them by means of cross-validation, bootstrap, or similar methods. Such a tool should strictly prevent test set observations from influencing model training and meta-parameter tuning, so-called information leakage, in order to not produce overly optimistic performance estimates. Here we present a new package for R denoted emil (evaluation of modeling without information leakage) that offers this form of performance evaluation. It provides a transparent and highly customizable framework for facilitating the assembly, execution, performance evaluation, and interpretation of complete procedures for classification, regression, and survival analysis. The components of package emil have been designed to be as modular and general as possible to allow users to combine, replace, and extend them if needed. Package emil was also developed with scalability in mind and has a small computational overhead, which is a key requirement for analyzing the very big data sets now available in fields like medicine, physics, and finance. First package emil's functionality and usage is explained. Then three specific application examples are presented to show its potential in terms of parallelization, customization for survival analysis, and development of ensemble models. Finally a brief comparison to similar software is provided.
机译:用于预测建模问题(分类,回归或生存分析)的数据驱动的机器学习通常涉及许多步骤,这些步骤从数据预处理开始到性能评估结束。 R提供了大量为各个步骤提供工具的程序包,但缺少借助交叉验证,引导程序或类似方法来促进对其进行组装的完整过程进行严格性能评估的工具。这种工具应严格防止测试集的观察结果影响模型训练和元参数调整,即所谓的信息泄漏,以免产生过分乐观的性能估计。在这里,我们提出了一种新的R包,称为emil(建模评估,没有信息泄漏),提供了这种形式的性能评估。它提供了一个透明且高度可定制的框架,以促进组装,执行,性能评估以及对用于分类,回归和生存分析的完整过程的解释。 emil软件包的组件设计为尽可能模块化和通用,以允许用户在需要时进行组合,替换和扩展。软件包emil的开发还考虑了可伸缩性,并且计算开销很小,这是分析当今在医学,物理学和金融等领域中可用的超大数据集的关键要求。解释了第一个软件包emil的功能和用法。然后,给出了三个特定的应用示例,以显示其在并行化,生存分析的自定义以及集成模型开发方面的潜力。最后,提供了与类似软件的简要比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号