首页> 外文会议>International conference on embedded computer systems: architectures, modeling and simulation >Fully Distributed Deep Learning Inference on Resource-Constrained Edge Devices
【24h】

Fully Distributed Deep Learning Inference on Resource-Constrained Edge Devices

机译:资源受限的边缘设备上的完全分布式深度学习推理

获取原文

摘要

Performing inference tasks of deep learning applications on IoT edge devices ensures privacy of input data and can result in shorter latency when compared to a cloud solution. As most edge devices are memory- and compute-constrained, they cannot store and execute a complete Deep Neural Network (DNN). One possible solution is to distribute the DNN across multiple edge devices. For a complete distribution, both fully-connected and feature- and weight-intensive convolutional layers need to be partitioned to reduce the amount of computation and data on each resource-constrained edge device. At the same time, resulting communication overheads need to be considered. Existing work on distributed DNN execution can not support all types of networks and layers or does not account for layer fusion opportunities to reduce communication. In this paper, we jointly optimize memory, computation and communication demands for distributed execution of complete neural networks covering all layers. This is achieved through techniques that combine both feature and weight partitioning with a communication-aware layer fusion approach to enable holistic optimization across layers. For a given number of edge devices, the schemes are applied jointly such that the amount of data to be exchanged between devices is minimized to optimize run time. Experimental results for a simulation of six edge devices on 100 Mbit connections running the YOLOv2 DNN model show that the schemes evenly balance the memory footprint between devices. The integration of layer fusion additionally leads to a reduction of communication demands by 14.8%. This results in run time speed-up of the inference task by 1.15x compared to partitioning without fusing.
机译:与云解决方案相比,在IoT边缘设备上执行深度学习应用程序的推理任务可确保输入数据的私密性,并可缩短等待时间。由于大多数边缘设备受内存和计算限制,因此它们无法存储和执行完整的深度神经网络(DNN)。一种可能的解决方案是将DNN分布在多个边缘设备上。对于完整的分发,需要对完全连接的卷积层和特征密集卷积层进行分区,以减少每个资源受限的边缘设备上的计算量和数据量。同时,需要考虑由此产生的通信开销。现有的有关分布式DNN执行的工作不能支持所有类型的网络和层,或者不能解决减少通信的层融合机会。在本文中,我们共同优化了内存,计算和通信需求,以实现覆盖所有层的完整神经网络的分布式执行。这是通过将特征和权重划分与通信感知层融合方法相结合以实现跨层整体优化的技术来实现的。对于给定数量的边缘设备,将这些方案联合应用,以使要在设备之间交换的数据量最小化,以优化运行时间。在运行YOLOv2 DNN模型的100 Mbit连接上对六个边缘设备进行仿真的实验结果表明,该方案均匀地平衡了设备之间的内存占用。层融合的集成还导致通信需求减少了14.8%。与没有融合的分区相比,这可以将推理任务的运行时间加快1.15倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号