Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

Francesc Massanes Marie Cadennes and Jovan G. Brankov

首页> 外文期刊>Journal of Electronic Imaging >Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

【24h】

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

机译：多个图形处理单元卡的块匹配算法的计算统一设备架构实现

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe and evaluate a fast implementation of a classi-ncal block-matching motion estimation algorithm for multiple graphicalnprocessing units (GPUs) using the compute uniﬁed device architec-nture computing engine. The implemented block-matching algorithmnuses summed absolute difference error criterion and full grid searchn(FS) for ﬁnding optimal block displacement. In this evaluation, wencompared the execution time of a GPU and CPU implementationnfor images of various sizes, using integer and noninteger searchngrids. The results show that use of a GPU card can shorten com-nputation time by a factor of 200 times for integer and 1000 timesnfor a noninteger search grid. The additional speedup for a noninte-nger search grid comes from the fact that GPU has built-in hardwarenfor image interpolation. Further, when using multiple GPU cards,nthe presented evaluation shows the importance of the data split-nting method across multiple cards, but an almost linear speedupnwith a number of cards is achievable. In addition, we compared thenexecution time of the proposed FS GPU implementation with twonexisting, highly optimized nonfull grid search CPU-based motion es-ntimations methods, namely implementation of the Pyramidal LucasnKanade Optical ﬂow algorithm in OpenCV and simpliﬁed unsymmet-nrical multi-hexagon search in H.264/AVC standard. In these com-nparisons, FS GPU implementation still showed modest improvementneven though the computational complexity of FS GPU implementa-ntion is substantially higher than non-FS CPU implementation. Wenalso demonstrated that for an image sequence of 720 × 480 pixels innresolution commonly used in video surveillance, the proposed GPUnimplementation is sufﬁciently fast for real-time motion estimation atn30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

机译：我们使用计算统一设备架构计算引擎，针对多个图形处理单元（GPU）描述并评估了经典ncal块匹配运动估计算法的快速实现。实施的块匹配算法将求和的绝对差值误差准则与全网格搜索n（FS）相加，以找到最佳的块位移。在此评估中，使用整数和非整数搜索引擎比较了各种尺寸图像的GPU和CPU实现的执行时间。结果表明，使用GPU卡可以将计算时间缩短为整数的200倍，对于非整数的搜索网格则为1000倍。非整数搜索网格的额外加速来自于GPU具有用于图像插值的内置硬件这一事实。此外，当使用多个GPU卡时，本评估显示了跨多个卡的数据拆分方法的重要性，但是使用多个卡几乎可以实现线性加速。此外，我们将提出的FS GPU实现的执行时间与两种基于高度优化的非完全网格搜索基于CPU的运动估计方法进行了比较，即在OpenCV中实现金字塔形LucasnKanade光学算法和简化的非对称多边形六边形搜索在H.264 / AVC标准中。在这些比较中，即使FS GPU实施的计算复杂度明显高于非FS CPU实施，但FS GPU实施仍显示出适度的改进。 Wen还演示了对于视频监控中通常使用的720×480像素像素分辨率的图像序列，使用两块NVIDIA C1060 Tesla GPU卡以每秒30帧的速度进行实时运动估计时，建议的GPU实现足够快。

著录项

来源
《Journal of Electronic Imaging》 |2011年第3期|p.1-11|共11页
作者
Francesc Massanes Marie Cadennes and Jovan G. Brankov;
展开▼
作者单位

Illinois Institute of TechnologyMedical Imaging Research CenterChicago, Illinois 60616E-mail: brankov@iit.edu;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards [J] . Francesc Massanes, Marie Cadennes, Jovan G. Brankov Journal of electronic imaging . 2011,第3期

机译：多个图形处理单元卡的块匹配算法的计算统一设备架构实现
2. Hardware Implementation of Instruction Level Parallel Architecture Incorporating Special Functional Units for Image Processing Algorithms [J] . M. Kannan, S.K. Srivatsa Information Technology Journal . 2006,第3期

机译：包含特殊功能单元的图像处理算法的指令级并行体系结构的硬件实现
3. Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architectures: implementation of a 1D c++/OpenMP electrostatic particle-in-cell code [J] . A. Vapirev, J. Deca, G. Lapenta, Concurrency and computation: practice and experience . 2015,第3期

机译：英特尔许多集成核，沙桥和图形处理单元体系结构的计算性能的初步结果：实现一维c ++ / OpenMP静电粒子编码
4. Implementation of Three SIMD Algorithms for Graphical User Interface Processing in Mobile Devices Using the Atsana J2210 Media Processor [C] . Kristopher C. Breen, Jesus Hernandez Tapia, Duncan G. Elliott Canadian Conference on Electrical and Computer Engineering . 2005

机译：使用ATSANA J2210媒体处理器实现移动设备中的三种SIMD算法的图形用户界面处理
5. Analysis and implementation of Room Assignment problem and Cannon's algorithm on general purpose programmable graphical processing units with CUDA. [D] . Dwivedi, Harsh Vardhan. 2011

机译：在具有CUDA的通用可编程图形处理单元上分析和实施房间分配问题和Cannon算法。
6. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards [O] . Francesc Massanes, Marie Cadennes, Jovan G. Brankov -1

机译：计算的统一设备架构实现块匹配算法的多个图形处理单元卡
7. Implementing Algorithms for Signal and Image Reconstruction on Graphical Processing Units [O] . Sangkyun Lee, Stephen J. Wright 2012

机译：图形处理单元上信号和图像重建的实现算法
8. Designing and Implementing an OVERFLOW Reader for ParaView and Comparing Performance Between Central Processing Units and Graphical Processing Units [R] . 2010

机译：为paraView设计和实现OVERFLOW读取器并比较中央处理单元和图形处理单元之间的性能

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

摘要

著录项

相似文献

相关主题

期刊订阅