首页> 外文期刊>The Visual Computer >A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization
【24h】

A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization

机译:SIMD高效的14指令着色器程序,用于高通量微三角光栅化

获取原文
获取原文并翻译 | 示例
           

摘要

This paper shows that breaking the barrier of 1 triangle/clock rasterization rate for microtriangles in modern GPU architectures in an efficient way is possible. The fixed throughput of the special purpose culling and triangle setup stages of the classic pipeline limits the GPU scalability to rasterize many triangles in parallel when these cover very few pixels. In contrast, the shader core counts and increasing GFLOPs in modern GPUs clearly suggests parallelizing this computation entirely across multiple shader threads, making use of the powerful wide-ALU instructions. In this paper, we present a very efficient SIMD-like rasterization code targeted at very small triangles that scales very well with the number of shader cores and has higher performance than traditional edge equation based algorithms. We have extended the ATTILA GPU shader ISA (del Barrioet al. in IEEE International Symposium on Performance Analysis of Systemsrnand Software, pp. 231-241, 2006) with two fixed point instructions to meet the rasterization precision requirement. This paper also introduces a novel subpixel Bounding Box size optimization that adjusts the bounds much more finely, which is critical for small triangles, and doubles the 2 × 2-pixel stamp test efficiency. The proposed shader rasterization program can run on top of the original pixel shader program in such a way that selected fragments are rasterized, attribute interpolated and pixel shaded in the same pass. Our results show that our technique yields better performance than a classic rasterizer at 8 or more shader cores, with speedups as high as 4 × for 16 shader cores.
机译:本文表明,可以有效地打破现代GPU架构中微三角形的1三角形/时钟光栅化率的障碍。经典流水线的专用剔除和三角形设置阶段的固定吞吐量限制了GPU的可伸缩性,以在许多三角形覆盖很少的像素时并行光栅化许多三角形。相比之下,现代GPU中的着色器核心数量和不断增加的GFLOP显然建议利用强大的Wide-ALU指令在多个着色器线程之间完全并行化此计算。在本文中,我们提出了一种针对非常小的三角形的非常有效的类SIMD光栅化代码,该代码可以很好地缩放着色器核心的数量,并且比基于传统边缘方程的算法具有更高的性能。我们使用两个定点指令扩展了ATTILA GPU着色器ISA(del Barrioet等人在IEEE International Symposium on Systemsrnand Software进行的IEEE International Symposium on Systemsrnand Software,pp.231-241,2006年)中,以满足光栅化精度要求。本文还介绍了一种新颖的子像素边界框尺寸优化方法,该方法可以更精细地调整边界,这对于小三角形至关重要,并使2×2像素图章测试效率翻倍。所提出的着色器栅格化程序可以在原始像素着色器程序之上运行,以使选定的片段在同一遍中被栅格化,属性插值和像素着色。我们的结果表明,与8个或更多着色器核心的经典光栅化器相比,我们的技术可产生更好的性能,而16个着色器核心的加速高达4倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号