首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs
【24h】

cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

机译:cuTensor-Tubal:用于GPU上的Tubal-Rank张量学习操作的有效基元

获取原文
获取原文并翻译 | 示例

摘要

Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 imes, 27.03 imes, 38.97 imes, 22.36 imes, 15.43 imes$16.91,27.03,38.97,22.36,15.43 speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 imes$9.80 and $269.26 imes$269.26 speedups over multi-core CPU implementations.
机译:张量是高性能计算,大数据分析和机器学习中的基础数据结构。但是,张量计算是计算密集型的,并且运行时间随张量大小的增加而迅速增加。因此,在并行架构(例如GPU)上设计高性能原语对于不断增长的数据处理需求的效率至关重要。现有的GPU基本线性代数子例程(BLAS)库(例如NVIDIA cuBLAS)不提供张量基元。研究人员必须逐案实施和优化自己的张量算法,这种方法效率低下且容易出错。在本文中,我们为GPU上的输卵管秩张量模型开发了七个关键原语的cuTensor-tubal库:t-FFT,逆t-FFT,t乘积,t-SVD,t-QR,t逆,和t归一化。 cuTensor-tubal采用频域计算方案来揭示频域中的可分离性,然后将管状和切片并行性映射到单指令多线程(SIMT)GPU架构上。为了获得良好的性能,我们优化了数据传输,内存访问,并分别针对张量操作设计了分批和流式并行化方案,以分别使用与数据无关和与数据相关的计算模式。在对t乘积,t-SVD,t-QR,t逆和t归一化的评估中,cuTensor-tubal达到最高$ 16.91 times,27.03 times,38.97 times,22.36 times,15.43 times $ 16.91,在双10核Xeon CPU上运行的CPU实现上分别提高了27.03、38.97、22.36、15.43。使用我们的库测试了两个应用程序,即基于t-SVD的视频压缩和低管形张量完成,它们在多核CPU实施上的最高加速速度为$ 9.80 times $ 9.80和$ 269.26 times $ 269.26。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号