NUMA-aware multicore Matrix Multiplication.

机译：NUMA感知的多核矩阵乘法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The emerging Symmetric Multi-Processing (SMP) architectures have shifted from the Shared- Bus topology to Cache-coherent Non-Uniform Memory Access (ccNUMA), where the Pro- cessing Elements (PEs) can access the distributed memories with different delays. This shift potentially impacts the performance of current SMP software packages and the way they address complexity of the new architecture. In this work, we compare our User-level thread scheduling mechanism [16] with OpenMP scheduler to multiply two large matrices on a Dual-socket NUMA architecture. We analyzed and evaluated an optimized and multi- threaded implementation of Level-3 BLAS general matrix multiplication routine (DGEMM). We have shown that addressing the architectural awareness for such a memory intensive op- eration would minimize the memory bottlenecks and improve the utilization of the memory caches and consequently, the overall performance. In this work, we show one way of threads scheduling and data alignment that can reduce the number of cache-misses down to one-third of the cache-misses from the non-tuned implementation and reduce the required computation time by up to 22%. Finally, we show the relationship between the number of cache-misses and the gained speedup percentage of our implementation, which proves our hypothesis about the data locality problem and memory bottleneck in a non NUMA-aware implementation.

机译：新兴的对称多处理（SMP）架构已从共享总线拓扑结构转变为高速缓存一致性非均匀内存访问（ccNUMA），其中处理元素（PE）可以以不同的延迟访问分布式内存。这种转变可能会影响当前SMP软件包的性能以及它们解决新体系结构复杂性的方式。在这项工作中，我们将用户级线程调度机制[16]与OpenMP调度程序进行了比较，以在双路NUMA架构上将两个大型矩阵相乘。我们分析和评估了Level-3 BLAS通用矩阵乘法例程（DGEMM）的优化和多线程实现。我们已经表明，解决这种内存密集型操作的体系结构意识可以最大程度地减少内存瓶颈，并提高内存缓存的利用率，从而提高整体性能。在这项工作中，我们展示了一种线程调度和数据对齐的方法，该方法可以将高速缓存未命中的次数减少到非调整实现的高速缓存未命中的三分之一，并将所需的计算时间减少多达22％。最后，我们展示了高速缓存未命中次数与实现中获得的加速百分比之间的关系，这证明了我们关于非NUMA感知实现中的数据局部性问题和内存瓶颈的假设。

著录项

作者
Alkowaileet, Wail Yousef.;
展开▼
作者单位

University of California, Irvine.;

展开▼
授予单位 University of California, Irvine.;
学科 Computer Science.
学位 M.S.
年度 2013
页码 51 p.
总页数 51
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. NUMA-Aware Multicore Matrix Multiplication [J] . WAIL Y. ALKOWAILEET, DAVID CARRILLO-CISNEROS, ROBERT V. LIM, Parallel Processing Letters . 2014,第4期

机译：NUMA感知多核矩阵乘法
2. Multicore Performance Requires NUMA-Aware Hypervisor Design [J] . RTC . 2015,第2015期

机译：多核性能需要NUMA感知的Hypervisor设计
3. Multicore Performance Requires NUMA-Aware Hypervisor Design [J] . RTC . 2015,第2015期

机译：多核性能需要NUMA感知的Hypervisor设计
4. A Case for NUMA-Aware Contention Management on Multicore Systems [C] . Sergey Blagodurqv, Sergey Zhuravley, Alexandra Fedoroya, 19th international conference on parallel architectures and compilation techniques 2010 . 2010

机译：多核系统上具有NUMA意识的竞争管理的案例
5. Re-configurable matrix multiplication. [D] . Yadav, Yamini. 2005

机译：可重新配置的矩阵乘法。
6. An Improved Distance Matrix Computation Algorithm for Multicore Clusters [O] . Mohammed W. Al-Neama, Naglaa M. Reda, Fayed F. M. Ghaleb -1

机译：一种改进的多核集群距离矩阵计算算法
7. A case for numa-aware contention management on multicore systems [O] . Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova, 2013

机译：在多核系统上进行numa-aware争用管理的案例

NUMA-aware multicore Matrix Multiplication.

摘要

著录项

相似文献

相关主题

期刊订阅