Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

机译：在分布式计算环境中评估TensorFlow性能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Tensorflow (TF) is a highly popular Deep Learning (DL) software framework. Neural network training, a critical part of DL workflow, is a computationally intensive process that can take days or even weeks. Therefore, achieving faster training times is an active area of research and practise. TF supports multiple GPU parallelization, both within a single machine and between multiple physical servers. However, the distributed case is hard to use and consequently, almost all published performance data comes from the single machine use case. To fill this gap, here we benchmark Tensorflow in a GPU-equipped distributed environment. Our work evaluates performance of various hardware and software combinations. In particular, we examine several types of interconnect technologies to determine their impact on performance. Our results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.

机译：Tensorflow（TF）是一个高度受欢迎的深度学习（DL）软件框架。神经网络培训是DL工作流的关键部分，是一种计算密集的过程，可能需要几天甚至几周。因此，实现更快的培训时间是一个活跃的研究和实践领域。 TF在单台机器和多个物理服务器之间支持多个GPU并行化。但是，分布式案例难以使用，因此，几乎所有公开的性能数据都来自单机使用情况。为了填补这个差距，在这里我们在配备GPU的分布式环境中基于TensorFlow。我们的工作评估了各种硬件和软件组合的性能。特别是，我们检查几种类型的互连技术，以确定它们对性能的影响。我们的研究结果表明，通过正确选择输入参数和适当的硬件，配备GPU的通用计算集群可以为专为AI工作负载设计的专业机器提供可比的深度学习训练性能。

著录项

来源
《TPC Technology Conference on Performance Evaluation and Benchmarking》|2019年|154p|共12页
会议地点
作者
Miro Hodak; Ajay Dholakia;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Tensorflow; Deep learning; GPU; Distributed computing Performance;

机译：Tensorflow;深入学习;GPU;分布式计算性能;

相似文献

外文文献
中文文献
专利

1. Implementation and performance evaluation of a distributed conjugate gradient method in a cloud computing environment [J] . Leila Ismail, Rajeev Barua Software . 2013,第3期

机译：云计算环境中分布式共轭梯度法的实现与性能评估
2. Performance evaluation of measurement data acquisition mechanisms in a distributed computing environment integrating remote laboratory instrumentation [J] . Luca Berruti, Franco Davoli, Sandro Zappatore Future generation computer systems . 2013,第2期

机译：集成远程实验室仪器的分布式计算环境中测量数据采集机制的性能评估
3. TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML [J] . Thorsten Kurth, Mikhail Smorkalov, PeterMendygral, Concurrency, practice and experience . 2019,第16期

机译：大规模TensorFlow：使用Horovod，MLSL和Cray PE ML进行分布式培训的性能和生产力分析
4. Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment [C] . Miro Hodak, Ajay Dholakia TPC Technology Conference on Performance Evaluation and Benchmarking . 2019

机译：在分布式计算环境中评估TensorFlow性能
5. Election protocols in distributed computing systems and distributed databases and their performance evaluation [D] . El-Ruby, Mohamed Hassan. 1990

机译：分布式计算系统和分布式数据库中的选举协议及其性能评估
6. aRNApipe: a balanced efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments [O] . Arnald Alonso, Brittany N Lasseigne, Kelly Williams, -1

机译：aRNApipe：一种平衡高效且分布式的管道用于在高性能计算环境中处理RNA-seq数据
7. Design and Experimental Evaluation of DeepMarket: An Edge Computing Marketplace with Distributed TensorFlow Execution Capability [O] . Soyoung Kim 2000

机译：DeepMarket的设计与实验评估：具有分布式Tensorflow执行能力的边缘计算市场

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

摘要

著录项

相似文献

相关主题

期刊订阅