Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

Francesc Massanes; Marie Cadennes; Jovan G. Brankov

首页> 外文期刊>Journal of electronic imaging >Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

【24h】

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

机译：多个图形处理单元卡的块匹配算法的计算统一设备架构实现

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmet-rical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 x 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

机译：我们使用计算统一设备体系结构计算引擎来描述和评估针对多个图形处理单元（GPU）的经典块匹配运动估计算法的快速实现。所实现的块匹配算法使用求和的绝对差误差标准和全网格搜索（FS）来找到最佳块位移。在此评估中，我们使用整数和非整数搜索网格比较了各种大小的图像的GPU和CPU实现的执行时间。结果表明，使用GPU卡可使整数的计算时间缩短200倍，非整数搜索网格的计算时间缩短1000倍。非整数搜索网格的额外加速来自于GPU具有用于图像插值的内置硬件这一事实。此外，当使用多个GPU卡时，所提供的评估结果表明了跨多个卡进行数据拆分方法的重要性，但是使用多个卡几乎可以实现线性加速。此外，我们将拟议的FS GPU实现的执行时间与两种现有的，高度优化的基于非全网格搜索基于CPU的运动估计方法进行了比较，即在OpenCV中实现金字塔形Lucas Kanade光流算法和简化的非对称多六边形在H.264 / AVC标准中搜索。在这些比较中，即使FS GPU实施的计算复杂度明显高于非FS CPU实施，但FS GPU实施仍显示出适度的改进。我们还证明，对于视频监控中通常使用的分辨率为720 x 480像素的图像序列，建议的GPU实施对于使用两块NVIDIA C1060 Tesla GPU卡以每秒30帧的速度进行实时运动估计而言，足够快。

著录项

来源
《Journal of electronic imaging》 |2011年第3期|p.033004.1-033004.10|共10页
作者
Francesc Massanes; Marie Cadennes; Jovan G. Brankov;
展开▼
作者单位

Illinois Institute of Technology Medical Imaging Research Center Chicago, Illinois 60616;

Illinois Institute of Technology Medical Imaging Research Center Chicago, Illinois 60616;

Illinois Institute of Technology Medical Imaging Research Center Chicago, Illinois 60616;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards [J] . Francesc Massanes Marie Cadennes and Jovan G. Brankov Journal of Electronic Imaging . 2011,第3期

机译：多个图形处理单元卡的块匹配算法的计算统一设备架构实现
2. Hardware Implementation of Instruction Level Parallel Architecture Incorporating Special Functional Units for Image Processing Algorithms [J] . M. Kannan, S.K. Srivatsa Information Technology Journal . 2006,第3期

机译：包含特殊功能单元的图像处理算法的指令级并行体系结构的硬件实现
3. Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code [J] . A. Vapirev, J. Deca, G. Lapenta, Concurrency and computation: practice and experience . 2015,第3期

机译：英特尔许多集成核，沙桥和图形处理单元体系结构的计算性能的初步结果：实现一维c ++ / OpenMP静电粒子编码
4. Implementation of Three SIMD Algorithms for Graphical User Interface Processing in Mobile Devices Using the Atsana J2210 Media Processor [C] . Kristopher C. Breen, Jesus Hernandez Tapia, Duncan G. Elliott Canadian Conference on Electrical and Computer Engineering . 2005

机译：使用ATSANA J2210媒体处理器实现移动设备中的三种SIMD算法的图形用户界面处理
5. Analysis and implementation of Room Assignment problem and Cannon's algorithm on general purpose programmable graphical processing units with CUDA. [D] . Dwivedi, Harsh Vardhan. 2011

机译：在具有CUDA的通用可编程图形处理单元上分析和实施房间分配问题和Cannon算法。
6. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards [O] . Francesc Massanes, Marie Cadennes, Jovan G. Brankov -1

机译：计算的统一设备架构实现块匹配算法的多个图形处理单元卡
7. Implementing Algorithms for Signal and Image Reconstruction on Graphical Processing Units [O] . Sangkyun Lee, Stephen J. Wright 2012

机译：图形处理单元上信号和图像重建的实现算法
8. Designing and Implementing an OVERFLOW Reader for ParaView and Comparing Performance Between Central Processing Units and Graphical Processing Units [R] . 2010

机译：为paraView设计和实现OVERFLOW读取器并比较中央处理单元和图形处理单元之间的性能

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

摘要

著录项

相似文献

相关主题

期刊订阅