...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support Vector Machine Training
【24h】

A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support Vector Machine Training

机译:用于支持向量机训练的GPU体系结构优化的分层分解算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In the last decade, several GPU implementations of Support Vector Machine (SVM) training with nonlinear kernels were published. Some of them even with source codes. The most effective ones are based on Sequential Minimal Optimization (SMO). They decompose the restricted quadratic problem into a series of smallest possible subproblems, which are then solved analytically. For large datasets, the majority of elapsed time is spent by a large amount of matrix-vector multiplications that cannot be computed efficiently on current GPUs because of limited memory bandwidth. In this paper, we introduce a novel GPU approach to the SVM training that we call Optimized Hierarchical Decomposition SVM (OHD-SVM). It uses a hierarchical decomposition iterative algorithm that fits better to actual GPU architecture. The low decomposition level uses a single GPU multiprocessor to efficiently solve a local subproblem. Nowadays a single GPU multiprocessor can run thousand or more threads that are able to synchronize quickly. It is an ideal platform for a single kernel SMO-based local solver with fast local iterations. The high decomposition level updates gradients of entire training set and selects a new local working set. The gradient update requires many kernel values that are costly to compute. However, solving a large local subproblem offers an efficient kernel values computation via a matrix-matrix multiplication that is much more efficient than the matrix-vector multiplication used in already published implementations. Along with a description of our implementation, the paper includes an exact comparison of five publicly available C++ SVM training GPU implementations. In this paper, the binary classification task and RBF kernel function are taken into account as it is usual in most of the recent papers. According to the measured results on a wide set of publicly available datasets, our proposed approach excelled significantly over the other methods in all datasets. The biggest difference was on the largest dataset where we achieved speed-up up to 12 times in comparison with the fastest already published GPU implementation. Moreover, our OHD-SVM is the only one that can handle dense as well as sparse datasets. Along with this paper, we published the source-codes at https://github.com/OrcusCZ/OHD-SVM.
机译:在过去的十年中,发布了使用非线性内核的支持向量机(SVM)训练的几种GPU实现。其中一些甚至带有源代码。最有效的方法基于顺序最小优化(SMO)。他们将受限二次问题分解为一系列可能的最小子问题,然后通过解析来解决。对于大型数据集,由于耗费的内存带宽有限,大量的矩阵向量乘法无法使用大量的矩阵向量乘法运算,而这些运算在当前的GPU上无法有效计算。在本文中,我们为SVM培训介绍了一种新颖的GPU方法,称为优化分层分解SVM(OHD-SVM)。它使用分层分解迭代算法,更适合实际的GPU架构。低分解级别使用单个GPU多处理器来有效解决局部子问题。如今,单个GPU多处理器可以运行成千上万个线程,这些线程可以快速同步。对于具有快速局部迭代的基于单核SMO的局部求解器而言,它是理想的平台。高分解级别更新整个训练集的梯度并选择一个新的本地工作集。梯度更新需要许多计算成本很高的内核值。但是,解决大型局部子问题可通过矩阵矩阵乘法提供有效的内核值计算,该效率比已发布的实现中使用的矩阵向量乘法效率更高。除了对我们的实现的描述之外,本文还包括对五个公开可用的C ++ SVM训练GPU实现的精确比较。在本文中,二进制分类任务和RBF核函数被考虑在内,这是大多数近期论文中常见的。根据对大量公开可用数据集的测量结果,我们提出的方法在所有数据集中均优于其他方法。最大的差异在于最大的数据集,与已经发布的最快的GPU实施相比,我们的速度提高了12倍。而且,我们的OHD-SVM是唯一可以处理密集数据集和稀疏数据集的SVM。与本文一起,我们在https://github.com/OrcusCZ/OHD-SVM上发布了源代码。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号