...
首页> 外文期刊>Journal of Electronic Imaging >Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
【24h】

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

机译:多个图形处理单元卡的块匹配算法的计算统一设备架构实现

获取原文
获取原文并翻译 | 示例
           

摘要

We describe and evaluate a fast implementation of a classi-ncal block-matching motion estimation algorithm for multiple graphicalnprocessing units (GPUs) using the compute unified device architec-nture computing engine. The implemented block-matching algorithmnuses summed absolute difference error criterion and full grid searchn(FS) for finding optimal block displacement. In this evaluation, wencompared the execution time of a GPU and CPU implementationnfor images of various sizes, using integer and noninteger searchngrids. The results show that use of a GPU card can shorten com-nputation time by a factor of 200 times for integer and 1000 timesnfor a noninteger search grid. The additional speedup for a noninte-nger search grid comes from the fact that GPU has built-in hardwarenfor image interpolation. Further, when using multiple GPU cards,nthe presented evaluation shows the importance of the data split-nting method across multiple cards, but an almost linear speedupnwith a number of cards is achievable. In addition, we compared thenexecution time of the proposed FS GPU implementation with twonexisting, highly optimized nonfull grid search CPU-based motion es-ntimations methods, namely implementation of the Pyramidal LucasnKanade Optical flow algorithm in OpenCV and simplified unsymmet-nrical multi-hexagon search in H.264/AVC standard. In these com-nparisons, FS GPU implementation still showed modest improvementneven though the computational complexity of FS GPU implementa-ntion is substantially higher than non-FS CPU implementation. Wenalso demonstrated that for an image sequence of 720 × 480 pixels innresolution commonly used in video surveillance, the proposed GPUnimplementation is sufficiently fast for real-time motion estimation atn30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
机译:我们使用计算统一设备架构计算引擎,针对多个图形处理单元(GPU)描述并评估了经典ncal块匹配运动估计算法的快速实现。实施的块匹配算法将求和的绝对差值误差准则与全网格搜索n(FS)相加,以找到最佳的块位移。在此评估中,使用整数和非整数搜索引擎比较了各种尺寸图像的GPU和CPU实现的执行时间。结果表明,使用GPU卡可以将计算时间缩短为整数的200倍,对于非整数的搜索网格则为1000倍。非整数搜索网格的额外加速来自于GPU具有用于图像插值的内置硬件这一事实。此外,当使用多个GPU卡时,本评估显示了跨多个卡的数据拆分方法的重要性,但是使用多个卡几乎可以实现线性加速。此外,我们将提出的FS GPU实现的执行时间与两种基于高度优化的非完全网格搜索基于CPU的运动估计方法进行了比较,即在OpenCV中实现金字塔形LucasnKanade光学算法和简化的非对称多边形六边形搜索在H.264 / AVC标准中。在这些比较中,即使FS GPU实施的计算复杂度明显高于非FS CPU实施,但FS GPU实施仍显示出适度的改进。 Wen还演示了对于视频监控中通常使用的720×480像素像素分辨率的图像序列,使用两块NVIDIA C1060 Tesla GPU卡以每秒30帧的速度进行实时运动估计时,建议的GPU实现足够快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号