cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

【24h】

cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

机译：cuTensor-Tubal：用于GPU上的Tubal-Rank张量学习操作的有效基元

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 imes, 27.03 imes, 38.97 imes, 22.36 imes, 15.43 imes$16.91,27.03,38.97,22.36,15.43 speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 imes$9.80 and $269.26 imes$269.26 speedups over multi-core CPU implementations.

机译：张量是高性能计算，大数据分析和机器学习中的基础数据结构。但是，张量计算是计算密集型的，并且运行时间随张量大小的增加而迅速增加。因此，在并行架构（例如GPU）上设计高性能原语对于不断增长的数据处理需求的效率至关重要。现有的GPU基本线性代数子例程（BLAS）库（例如NVIDIA cuBLAS）不提供张量基元。研究人员必须逐案实施和优化自己的张量算法，这种方法效率低下且容易出错。在本文中，我们为GPU上的输卵管秩张量模型开发了七个关键原语的cuTensor-tubal库：t-FFT，逆t-FFT，t乘积，t-SVD，t-QR，t逆，和t归一化。 cuTensor-tubal采用频域计算方案来揭示频域中的可分离性，然后将管状和切片并行性映射到单指令多线程（SIMT）GPU架构上。为了获得良好的性能，我们优化了数据传输，内存访问，并分别针对张量操作设计了分批和流式并行化方案，以分别使用与数据无关和与数据相关的计算模式。在对t乘积，t-SVD，t-QR，t逆和t归一化的评估中，cuTensor-tubal达到最高$ 16.91 times，27.03 times，38.97 times，22.36 times，15.43 times $ 16.91，在双10核Xeon CPU上运行的CPU实现上分别提高了27.03、38.97、22.36、15.43。使用我们的库测试了两个应用程序，即基于t-SVD的视频压缩和低管形张量完成，它们在多核CPU实施上的最高加速速度为$ 9.80 times $ 9.80和$ 269.26 times $ 269.26。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2020年第3期|595-610|共16页
作者

展开▼
作者单位

Shanghai Univ Sch Comp Engn & Sci Shanghai 201824 Peoples R China;

Columbia Univ Dept Elect Engn New York NY 10027 USA;

Nokia Bell Labs Murray Hill NJ 07974 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Low-tubal-rank tensor decomposition; GPU; cuTensor-tubal library; t-SVD; tensor completion;

机译：低管秩张量分解;GPU;cuTensor-输卵管文库;t-SVD;张量完成;

相似文献

外文文献
中文文献
专利

1. High performance GPU primitives for graph-tensor learning operations [J] . Tao Zhang, Wang Kan, Xiao-Yang Liu Journal of Parallel and Distributed Computing . 2021,第Feba期

机译：高性能GPU基元用于图形 - 张量学习操作
2. Efficient Tensor Sensing for RF Tomographic Imaging on GPUs [J] . Da Xu, Tao Zhang Future Internet . 2019,第2期

机译：用于GPU上的RF断层成像的高效张量传感
3. GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs [J] . Gutierreza Pablo D., Lastra Miguel, Bacardit Jaume, Information Sciences: An International Journal . 2016,第Null期

机译：GPU-SME-kNN：可扩展且内存高效的kNN和使用GPU的惰性学习
4. Cutensor-tubal: Optimized GPU Library for Low-tubal-rank Tensors [C] . Tao Zhang, Xiao-Yang Liu IEEE International Conference on Acoustics, Speech and Signal Processing . 2019

机译：Cutensor-tubal：针对低管级张量的优化GPU库
5. Architecture-Aware Algorithm Design of Sparse Tensor/Matrix Primitives for GPUs [D] . Nisa, Israt J. 2019

机译：GPU稀疏张量/矩阵基元的体系结构感知算法设计
6. Efficient Probabilistic and Geometric Anatomical Mapping Using Particle Mesh Approximation on GPUs [O] . Linh Ha, Marcel Prastawa, Guido Gerig, 2011

机译：在GPU上使用粒子网格近似进行高效的概率和几何解剖映射
7. A Unified Optimization Approach for Sparse Tensor Operations on GPUs [O] . Liu, Bangtian, Wen, Chengyao, Sarwate, Anand D., 2017

机译：GpU上稀疏张量运算的统一优化方法

cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅