A Framework for Productive, Efficient and Portable Parallel Computing

机译：生产，高效和便携式并行计算的框架

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Developing efficient parallel implementations and fully utilizing the available resources of parallel platforms is now required for software applications to scale to new generations of processors. Yet, parallel programming remains challenging to programmers due to the requisite low-level knowledge of the underlying hardware and parallel computing constructs. Developing applications that effectively utilize parallel hardware is restricted by poor programmer productivity, low-level implementation requirements, and limited portability of the application code. These restrictions in turn impede experimentation with various algorithmic approaches for a given application. Currently, the programming world is divided into two types of programmers: application writers who focus on designing and prototyping applications and algorithms, and efficiency programmers who focus on extracting performance for particular compute kernels. The gap between these two types of programmers is referred to as "the implementation gap".;In this dissertation, we present a software environment that aims to bridge the implementation gap and enable application writers to productively utilize parallel hardware by reusing the work of efficiency programmers. Specifically, we present PyCASP, a Python-based software framework that automatically maps Python application code to a variety of parallel platforms. PyCASP is an application-domain-specific framework that uses a systematic, pattern-oriented approach to offer a single productive software development environment for application writers. PyCASP targets audio content analysis applications, but our methodology is designed to be applicable to any application domain. Using PyCASP, applications can be prototyped in Python code and our environment enables them to automatically scale their performance to modern parallel processors such as GPUs, multicore CPUs and compute clusters. We use the Selective Embedded JIT Specialization (SEJITS) mechanism to realize the pattern-based design of PyCASP in software. We use SEJITS to implement PyCASP's components and to enable automatic parallelization of specific audio content analysis application patterns on a variety of parallel hardware. By focusing on one application domain, we enable efficient composition of computations using three structural patterns: MapReduce, Iterator and Pipe-and-Filter.;To illustrate our approach, we study a set of four example audio content analysis applications that are architected and implemented using PyCASP: a speaker verification system, a speaker diarization system, a music recommendation system and a video event detection system. We describe the detailed implementation of two computational components of PyCASP: a Gaussian Mixture Model (GMM) component and a Support Vector Machine (SVM) component and their use in implementing the example applications. We also analyze composition of computations using the three structural patterns and implement the available optimizations for composing computations in audio analysis applications.;We evaluate our approach with results on productivity and performance using the two computational components and the four example applications. Our results illustrate that we can prototype the full-functioning applications in Python using 10 - 60x less lines of code than equivalent implementations using low-level languages. Our PyCASP components and example applications achieve and often exceed the efficiency of comparable hand-tuned low-level implementations. In addition to specialization, adding the optimizations for composing components in these applications can give up to 30% performance improvement. We show that applications written using PyCASP can be run on multiple parallel hardware backends with little or no application code change. PyCASP also enables applications to scale from one desktop GPU to a cluster of GPUs with little programmer effort. Combining all of the specialization and composition techniques, our example applications are able to automatically achieve 50-1000x faster-than-real-time performance on both multi-core CPU and GPU platforms and 15.5x speedup on 16-node cluster of GPUs showing near-optimal scaling.

机译：现在需要开发有效的并行实现并充分利用并行平台的可用资源，软件应用程序才能扩展到新一代处理器。然而，由于对底层硬件和并行计算结构的必要的底层知识，并行编程对程序员仍然具有挑战性。开发有效利用并行硬件的应用程序会受到程序员效率低下，底层实施要求以及应用程序代码可移植性有限的限制。这些限制反过来阻碍了针对给定应用的各种算法方法的实验。当前，编程世界分为两种类型的程序员：专注于设计和原型化应用程序和算法的应用程序编写者，以及专注于为特定计算内核提取性能的效率程序员。这两种类型的程序员之间的差距称为“实现差距”。在本文中，我们提出了一种软件环境，旨在弥合实现差距，并使应用程序编写者能够通过重用效率工作来有效地利用并行硬件。程序员。具体来说，我们介绍PyCASP，这是一个基于Python的软件框架，可自动将Python应用程序代码映射到各种并行平台。 PyCASP是一个特定于应用程序域的框架，它使用系统的，面向模式的方法为应用程序编写者提供单个生产性软件开发环境。 PyCASP针对音频内容分析应用程序，但是我们的方法旨在适用于任何应用程序领域。使用PyCASP，可以使用Python代码对应用程序进行原型制作，并且我们的环境使它们能够自动将性能扩展到现代并行处理器（例如GPU，多核CPU和计算群集）。我们使用选择性嵌入式JIT专业化（SEJITS）机制在软件中实现基于模式的PyCASP设计。我们使用SEJITS来实现PyCASP的组件，并在各种并行硬件上实现特定音频内容分析应用程序模式的自动并行化。通过专注于一个应用程序域，我们可以使用三种结构模式（MapReduce，Iterator和Pipe-and-Filter）有效地进行计算的组合；为说明我们的方法，我们研究了一组四个示例音频内容分析应用程序，这些应用程序已架构和实现使用PyCASP：演讲者验证系统，演讲者二值化系统，音乐推荐系统和视频事件检测系统。我们描述了PyCASP的两个计算组件的详细实现：一个高斯混合模型（GMM）组件和一个支持向量机（SVM）组件，以及它们在实现示例应用程序中的使用。我们还使用这三种结构模式来分析计算的组成，并在音频分析应用程序中实现用于构成计算的可用优化。;我们使用这两个计算组件和四个示例应用程序来评估我们的方法的生产率和性能结果。我们的结果表明，与使用低级语言的等效实现相比，使用10行-60倍的代码行可以在Python中为全功能应用程序提供原型。我们的PyCASP组件和示例应用程序达到并经常超过类似的手动调整的低级实现的效率。除了专业化之外，在这些应用程序中添加用于组成组件的优化可以使性能提高多达30％。我们证明了使用PyCASP编写的应用程序可以在多个并行硬件后端上运行，而几乎不需要更改应用程序代码。 PyCASP还使应用程序能够从一个桌面GPU扩展到一个GPU集群，而无需花费很多程序员的精力。结合所有专业化和合成技术，我们的示例应用程序能够在多核CPU和GPU平台上自动实现比实时性能快50-1000倍的性能，并在显示16个节点的GPU群集上自动实现15.5倍的加速性能-最佳缩放。

著录项

作者
Gonina, Ekaterina I.;
展开▼
作者单位

University of California, Berkeley.;

展开▼
授予单位 University of California, Berkeley.;
学科 Computer science.
学位 Ph.D.
年度 2013
页码 169 p.
总页数 169
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Portable and efficient parallel computing using the BSP model [J] . Goudreau M.W., Lang K. IEEE Transactions on Computers . 1999,第7期

机译：使用BSP模型进行可移植且高效的并行计算
2. BOAST: A metaprogramming framework to produce portable and efficient computing kernels for HPC applications [J] . Videau Brice, Pouget Kevin, Genovese Luigi, Experimental Mechanics . 2018,第1期

机译：BOAST：元编程框架，可为HPC应用程序生成可移植且高效的计算内核
3. Portable Parallel Performance from Sequential, Productive, Embedded Domain-Specific Languages [J] . Shoaib Kamil, Derrick Coetzee, Scott Beamer, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2012,第8期

机译：顺序，高效，嵌入式领域特定语言的可移植并行性能
4. Boda-RTC: Productive generation of portable, efficient code for convolutional neural networks on mobile computing platforms [C] . Matthew W. Moskewicz, Forrest N. Iandola, Kurt Keutzer IEEE International Conference on Wireless and Mobile Computing, Networking and Communications . 2016

机译：Boda-RTC：在移动计算平台上高效生产用于卷积神经网络的可移植，高效代码
5. Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments. [D] . Gunarathne, Thilina. 2014

机译：云上的可伸缩并行计算：高效且可伸缩的架构，可在云环境上执行令人满意的并行，MapReduce和迭代式数据密集型计算。
6. A scalable and portable framework for massively parallel variable selection in genetic association studies [O] . Gary K. Chen -1

机译：可扩展的便携式框架用于遗传关联研究中的大规模并行变量选择
7. Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms [O] . Moskewicz, Matthew, Iandola, Forrest, Keutzer, Kurt 2016

机译：Boda-RTC：生产性的便携式高效代码移动计算平台上的卷积神经网络

A Framework for Productive, Efficient and Portable Parallel Computing

摘要

著录项

相似文献

相关主题

期刊订阅