首页> 外文会议>International Conference on Algorithms and Architectures for Parallel Processing >Parallel Training GBRT Based on KMeans Histogram Approximation for Big Data
【24h】

Parallel Training GBRT Based on KMeans Histogram Approximation for Big Data

机译:并行培训GBRT基于Kmeans直方图逼近的大数据

获取原文

摘要

Gradient Boosting Regression Tree (GBRT), one of the state-of-the-art ranking algorithms widely used in industry, faces challenges in the big data era. With the rapid increase in the sizes of datasets, the iterative training process of GBRT becomes very time-consuming over large scale data. In this paper, we aim to speed up the training process of each tree in the GBRT framework. First, we propose a novel KMeans histogram building algorithm which has lower time complexity and is more efficient than the cutting-edge histogram building method. Further, we put forward an approximation algorithm by combining the kernel density estimation with the histogram technique to improve the accuracy. We conduct a variety of experiments on both the public Learning To Rank (LTR) benchmark datasets and the large-scale real-world datasets from Baidu search engine. The experimental results show that our proposed parallel training algorithm outperforms the state-of-the-art parallel GBRT algorithm with near 2 times speedup and better accuracy. Also, our algorithm achieves the near-linear scalability.
机译:梯度提升回归树(GBRT),是在工业中广泛应用的最先进的排名算法之一,面临大数据时代的挑战。随着数据集大小的快速增加,GBRT的迭代训练过程变得非常耗时,大规模数据。在本文中,我们的目标是加快GBRT框架中每棵树的培训过程。首先,我们提出了一种新颖的浏览器直方图构建算法,其具有较低的时间复杂性,并且比尖端直方图构建方法更有效。此外,我们通过将内核密度估计与直方图技术相结合来提出近似算法来提高精度。我们对公共学习的各种实验进行排名(LTR)基准数据集和来自百度搜索引擎的大型现实世界数据集。实验结果表明,我们所提出的并行训练算法优于最先进的并行GBRT算法,近2倍加速和更好的准确性。此外,我们的算法达到了近线性可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号