PERFORMANCE IMPROVEMENT OF HOT-PATH BASED THREAD PARTITIONING TECHNIQUE BY UNIFYING LOOP PARALLELIZATION

机译：通过统一循环并行化的热路径分区技术性能改进

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recently, multi-core processors have become widespread, and the performance of multi-core processors has increased. Thread level parallelism must be used to speed up a single program by making effective use of the potentially high performance of multi-core proces- sors. However, it is difficult to extract parallelism from non-numerical programs to speed up execution, because non-numerical programs have complex program structures and complicated data dependencies. However, programs generally have only a few paths that are actually executed, even if the program has a complex control structure. Based on this property, we have developed the hot-path based thread partitioning technique, which parallelizes non-numerical program codes at thread level along the most frequently executed path and can speed up programs with poor loop-level parallelism. Since the hot-path based thread partitioning technique assumes that non-numerical programs tend to have poor loop-level parallelism, this technique has a problem in that no loops are parallelized even if the loop can be parallelized and speedup can actually be attained by loop-level parallel execution. Therefore, the speedup of program code includ- ing loops is insufficient. Thus, in the present paper, we improve the hot-path based thread partitioning technique to allow the loops on the hot path to be parallelized by applying loop sectioning. Furthermore, we preliminarily evaluate the performance of an improved hot-path based thread partitioning technique by accurate cycle-based simulation using practical program codes. The evaluation result shows that the performance can be improved by the improved hot-path based thread partitioning technique, as compared to the original hot-path based thread partitioning technique.

机译：最近，多核处理器已普遍存在，多核处理器的性能增加。线程水平并行性必须通过有效地利用多核处理的可能性高性能来加速单个程序。然而，难以从非数字节目中提取并行性以加速执行，因为非数字节目具有复杂的程序结构和复杂的数据依赖性。然而，程序通常只有几条路径实际执行，即使程序具有复杂的控制结构。基于此属性，我们开发了基于热路径的线程分区技术，其在沿着最常用的路径中并将非数字节目代码并行化，并且可以加速循环级并行性差的程序。由于基于热路径的线路分区技术假定非数值程序倾向于具有差的环路级并行性，因此该技术的问题在于即使循环可以是并行的，即使循环可以平行，实际上可以获得加速度，但实际上可以获得循环-Level并行执行。因此，程序代码的加速包括环的不足。因此，在本文中，我们改善了基于热路径的螺纹分区技术，以允许通过施加环路切割来平行的热路径上的环。此外，我们使用实用程序代码准确地评估改进的热路径分区技术的性能。评估结果表明，与基于原始的热路径的线路分区技术相比，可以通过改进的热路径分区技术来提高性能。

著录项

来源
《IASTED international conference on parallel and distributed computing and systems》|2011年||共10页
会议地点
作者
Kanemitsu Ootsu; Takashi Yokota; Takanobu Baba;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机的应用;
关键词
multi-core processor; thread-level parallelism; thread par-titioning; speculative multithreading; program hot path;

机译：多核处理器;线程并行性;线程分布式;投机多线程;程序热路径;

相似文献

外文文献
中文文献
专利

1. Performance comparisons of bonding box-based contact detection algorithms and a new improvement technique based on parallelization [J] . Yazdani Mahmoud, Paseh Hamidreza, Sharifzadeh Mostafa Engineering Computations . 2016,第1期

机译：基于接线盒的接触检测算法的性能比较和基于并行化的新改进技术
2. A New Model Exploiting Loop Parallelization Using Knowledge-Based Techniques [J] . Chao-Tung Yang, Shian-Shyong Tseng, Sun-Wen Chuang, Proceedings of the National Science Council, Republic of China, Part A. Physical Science and Engineering . 1998,第3期

机译：利用基于知识的技术开发循环并行化的新模型
3. Thread partitioning and value prediction for exploiting speculative thread-level parallelism [J] . Marcuello P., Gonzalez A., Tubella J. IEEE Transactions on Computers . 2004,第2期

机译：线程分区和值预测，以利用推测性线程级并行性
4. PERFORMANCE IMPROVEMENT OF HOT-PATH BASED THREAD PARTITIONING TECHNIQUE BY UNIFYING LOOP PARALLELIZATION [C] . Kanemitsu Ootsu, Takashi Yokota, Takanobu Baba Proceedings of the 23rd IASTED international conference on parallel and distributed computing and systems. . 2011

机译：统一循环并行化，提高基于热路径的线程划分技术的性能
5. Loop unrolling along wavefronts and wavefront based techniques for exploiting instruction and thread level parellelism. [D] . Steinbrecher, Johann. 2013

机译：沿波前和基于波前的技术进行循环展开，以利用指令和线程级并行性。
6. Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization [O] . Krishna M Roskin, Benedict Paten, David Haussler 2011

机译：使用Crumble和Prune进行元对齐：对非常大的对齐问题进行分区以实现性能和并行化
7. Table 5: Performance of proposed GPU-based parallel implementation of permutation testing depending on whether memory coalescing technique was used (the number of CUDA blocks = 16, the number of threads per block = 256). [O] . -1

机译：表5：根据使用内存聚结技术是否使用基于GPU的平行实施的性能（CUDA块的数量= 16，每个块= 256的线数）。

PERFORMANCE IMPROVEMENT OF HOT-PATH BASED THREAD PARTITIONING TECHNIQUE BY UNIFYING LOOP PARALLELIZATION

摘要

著录项

相似文献

相关主题

期刊订阅