首页> 外文会议>International conference on Parallel architectures and compilation techniques >Program generation for the all-pairs shortest path problem
【24h】

Program generation for the all-pairs shortest path problem

机译:全对最短路径问题的程序生成

获取原文

摘要

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications. As the first contribution, we derive variants of the FW algorithm that make it possible to apply many of the optimization techniques developed for matrix-matrix multiplication. The second contribution is the actual program generator, which uses tiling, loop unrolling, and SIMD vectorization combined with a hill climbing search to produce the best code (float or integer) for a given platform. Using the program generator, we demonstrate a speedup over a straightforward single-precision implementation of up to a factor of 1.3 on Pentium 4 and 1.8 on Athlon 64. Use of 4-way vectorization further improves the performance by another factor of up to 5.7 on Pentium 4 and 3.0 on Athlon 64. For data type short integers, 8-way vectorization provides a speed-up of up to 4.6 on Pentium 4 and 5.0 on Athlon 64 over the best scalar code.
机译:计算领域的最新趋势是特定领域的程序生成器,旨在减轻为快速变化和日益复杂的计算平台移植和重新优化库的工作。示例包括ATLAS,SPIRAL和FFTW中的小码生成器。这些生成器中的每一个都直接根据问题说明生成高度优化的源代码。在本文中,我们通过程序生成器扩展了此列表,该程序生成器用于解决所有对最短路径问题的著名的Floyd-Warshall(FW)算法,这在广泛的工程应用中很重要。作为第一个贡献,我们得出了FW算法的变体,这些变体使得可以应用为矩阵矩阵乘法开发的许多优化技术。第二个贡献是实际的程序生成器,它使用平铺,循环展开和SIMD矢量化结合爬山搜索为给定平台生成最佳代码(浮点数或整数)。使用程序生成器,我们演示了奔腾4上高达1.3的直接单精度实现的加速以及Athlon 64上高达1.8的加速的效果。4路矢量化的使用进一步将性能提高了5.7倍。 Athlon 64上的Pentium 4和3.0。对于短整数数据类型,8路矢量化可以在最佳标量代码上将奔腾4的速度提高到4.6,在Athlon 64的速度上提高到5.0。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号