首页> 外文期刊>Computational Intelligence >Reducing correlation of random forest-based learning-to-rank algorithms using subsample size
【24h】

Reducing correlation of random forest-based learning-to-rank algorithms using subsample size

机译:使用子样本量减少基于森林的随机学习排名算法的相关性

获取原文
获取原文并翻译 | 示例

摘要

Learning-to-rank (LtR) has become an integral part of modern ranking systems. In this field, the random forest-based rank-learning algorithms are shown to be among of the top performers. Traditionally, each tree of a random forest is learnt using a bootstrapped copy of the training set, where approximately 63% of the examples are unique. The goal of using a bootstrapped copy instead of the original training set is to reduce the correlation between individual trees, thereby making the prediction of the ensemble more accurate. In this regard, the following question may be raised: how can we leverage the correlation between the trees in favor of performance and scalability of a random forest-based LtR algorithm? In this article, we investigate whether we can further decrease the correlation between the trees while maintaining or possibly improving accuracy. Among several potential options to achieve this goal, we investigate the size of the subsamples used for learning individual trees. We examine the performance of a random forest-based LtR algorithm as we control the correlation using this parameter. Experiments on LtR data sets reveal that for small- to moderate-sized data sets, substantial reduction in training time can be achieved using only a small amount of training data per tree. Moreover, due to the positive correlation between the variability across the trees and performance of a random forest, we observe an increase in accuracy while maintaining the same level of model stability as the baseline. For big data sets, although our experiments did not observe an increase in accuracy (because, with larger data sets, the individual tree variance is already comparatively smaller), our technique is still applicable as it allows for greater scalability.
机译:等级学习(LtR)已成为现代排名系统的组成部分。在该领域中,基于随机森林的等级学习算法显示为性能最高的算法之一。传统上,使用训练集的自举副本学习随机森林的每棵树,其中约63%的示例是唯一的。使用自举副本而不是原始训练集的目的是减少单个树之间的相关性,从而使整体的预测更加准确。在这方面,可能会提出以下问题:我们如何利用树之间的相关性来支持基于随机森林的LtR算法的性能和可伸缩性?在本文中,我们研究了是否可以在保持或可能提高准确性的同时进一步降低树之间的相关性。在实现这一目标的几种可能选择中,我们调查了用于学习单个树木的子样本的大小。当我们使用此参数控制相关性时,我们检查了基于随机森林的LtR算法的性能。对LtR数据集的实验表明,对于中小型数据集,每棵树仅使用少量训练数据就可以大大减少训练时间。此外,由于树木之间的变异性与随机森林的性能之间存在正相关关系,因此我们观察到准确性的提高,同时保持了与基线相同的模型稳定性。对于大数据集,尽管我们的实验并未观察到准确性的提高(因为使用更大的数据集,单个树的方差已经相对较小),但是我们的技术仍然适用,因为它可以提供更大的可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号