首页> 外文会议>European conference on genetic programming >Balancing Learning and Overfitting in Genetic Programming with Interleaved Sampling of Training Data
【24h】

Balancing Learning and Overfitting in Genetic Programming with Interleaved Sampling of Training Data

机译:通过训练数据的交错采样来平衡遗传编程中的学习和过度拟合

获取原文

摘要

Generalization is the ability of a model to perform well on cases not seen during the training phase. In Genetic Programming generalization has recently been recognized as an important open issue, and increased efforts are being made towards evolving models that do not overfit. In this work we expand on recent developments that showed that using a small and frequently changing subset of the training data is effective in reducing overfitting and improving generalization. Particularly, we build upon the idea of randomly choosing a single training instance at each generation and balance it with periodically using all training data. The motivation for this approach is based on trying to keep overfitting low (represented by using a single training instance) and still presenting enough information so that a general pattern can be found (represented by using all training data). We propose two approaches called interleaved sampling and random interleaved sampling that respectively represent doing this balancing in a deterministic or a probabilistic way. Experiments are conducted on three high-dimensional real-life datasets on the pharmacokinetics domain. Results show that most of the variants of the proposed approaches are able to consistently improve generalization and reduce overfitting when compared to standard Genetic Programming. The best variants are even able of such improvements on a dataset where a recent and representative state-of-the-art method could not. Furthermore, the resulting models are short and hence easier to interpret, an important achievement from the applications' point of view.
机译:泛化是模型在训练阶段未看到的案例上表现良好的能力。在遗传编程中,泛化最近已被认为是一个重要的开放问题,并且正在为开发不会过度拟合的模型做出更多的努力。在这项工作中,我们对最近的发展进行了扩展,这些发展表明,使用少量且经常变化的训练数据子集可以有效地减少过度拟合和改善泛化的情况。特别是,我们基于在每一代随机选择一个训练实例,并定期使用所有训练数据来平衡它的想法。这种方法的动机是基于尝试将过拟合保持在较低水平(通过使用单个训练实例表示),并且仍然提供足够的信息,以便可以找到常规模式(通过使用所有训练数据表示)。我们提出了两种方法,分别称为交错采样和随机交错采样,分别表示以确定性或概率性方式进行此平衡。在药代动力学领域对三个高维现实生活数据集进行了实验。结果表明,与标准的遗传程序设计相比,所提出方法的大多数变体能够一致地提高泛化能力并减少过度拟合。最好的变体甚至可以在数据集上进行此类改进,而最新的代表性技术是无法做到的。此外,生成的模型很短,因此更易于解释,这是从应用程序角度来看的一项重要成就。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号