This paper is devoted to GPU kernel optimization and performance analysis ofthree tensor-product operators arising in finite element methods. We provide amathematical background to these operations and implementation details.Achieving close-to-the-peak performance for these operators requires extensiveoptimization because of the operators' properties: low arithmetic intensity,tiered structure, and the need to store intermediate results inside the kernel.We give a guided overview of optimization strategies and we present aperformance model that allows us to compare the efficacy of these optimizationsagainst an empirically calibrated roofline.
展开▼