Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort

机译：在CPU和GPU上进行快速排序：带宽忽略型SIMD排序的案例

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 2~(40) (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

机译：排序是许多数据库操作中使用的基本内核。内存中排序现在是可行的。排序性能受计算触发器和主内存带宽而不是I / O的限制。在本文中，我们对两种现代架构（最新的CPU和GPU架构）上基于比较和非比较的排序算法进行了竞争性分析。我们提出了新颖的CPU基数排序和GPU合并排序实现，这些实现比以前发布的结果快2倍。我们使用两种体系结构上性能最佳的实施方案对算法进行公平的比较。尽管在当前架构上基数排序更快，但是从CPU到GPU架构的差距在缩小。对于大尺寸的键排序，Merge排序比基数排序要好-需要这种键来适应未来数据库不断增长的基数。我们提供了分析模型，用于根据体系结构特征（例如核心数，SIMD和带宽）来分析实现的性能。我们的模型成功地预测了我们获得的性能结果。我们的分析指出，由于对SIMD的有效利用和低带宽利用率，在未来的体系结构中合并排序胜过基数排序。我们模拟了一个64核心平台，该平台在每个核心约束恒定带宽下具有可变SIMD宽度，并显示2〜（40）个大数据大小（一万亿条记录），大键大小上的合并排序性能比基数高3倍。在将来的体系结构上对较大的SIMD宽度进行排序。因此，合并排序应该是将来数据库选择的排序方法。

著录项

来源
《ACM SIGMOD international conference on management of data;SIGMOD 2010》|2010年|P.351-362|共12页
会议地点
作者
Nadathur Satish; Changkyu Kim; Jatin Chhugani; Anthony D. Nguyen; Victor W. Lee; Daehyun Kim; Pradeep Dubey;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
performance; algorithms;

机译：性能;算法;

相似文献

外文文献
中文文献
专利

1. SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs [J] . Haque IS, Pande VS, Walters WP Journal of chemical information and modeling . 2010,第4期

机译：SIML：一种快速SIMD算法，用于计算GPU和CPU上的LINGO化学相似性
2. A comparison-free sorting algorithm on CPUs and GPUs [J] . Abdel-hafeez Saleh, Gordon-Ross Ann, Abubaker Samer Journal of supercomputing . 2018,第11期

机译：CPU和GPU上的免比较排序算法
3. Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture [J] . Matvienko Sergey, Alemasov Nikolay, Fomin Eduard Journal of Bioinformatics and Computational Biology . 2015,第1期

机译：基于多核simd CPU架构的分子动力学交互排序方法
4. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort [C] . ACM SIGMOD international conference on management of data . 2010

机译：快速排序CPU和GPU：带宽忘记SIMD排序的情况
5. Efficient Viewshed Computation Algorithms on GPUs and CPUs [D] . Qarah, Faisal F. 2020

机译：GPU和CPU上有效的viewShed计算算法
6. SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs [O] . Imran S. Haque, Vijay S. Pande, W. Patrick Walters -1

机译：sImL：一种快速sImD算法计算在GpU和CpU LINGO化学相似性
7. SIML: a fast SIMD algorithm for calculating LINGO chemical similarities on GPUs and CPUs [O] . Imran S. Haque, Vijay S. P, W. Patrick Walters 2010

机译：sImL：一种快速sImD算法，用于计算GpU和CpU上的LINGO化学相似性

Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort

摘要

著录项

相似文献

相关主题

期刊订阅