首页> 外文会议>International Conference on Computational Science >Fully-Asynchronous Cache-Efficient Simulation of Detailed Neural Networks
【24h】

Fully-Asynchronous Cache-Efficient Simulation of Detailed Neural Networks

机译:详细神经网络的全异步高速缓存有效仿真

获取原文

摘要

Modern asynchronous runtime systems allow the re-thinking of large-scale scientific applications. With the example of a simulator of morphologically detailed neural networks, we show how detaching from the commonly used bulk-synchronous parallel (BSP) execution allows for the increase of prefetching capabilities, better cache locality, and a overlap of computation and communication, consequently leading to a lower time to solution. Our strategy removes the operation of collective synchronization of ODEs' coupling information, and takes advantage of the pairwise time dependency between equations, leading to a fully-asynchronous exhaustive yet not speculative stepping model. Combined with fully linear data structures, communication reduce at compute node level, and an earliest equation steps first scheduler, we perform an acceleration at the cache level that reduces communication and time to solution by maximizing the number of timesteps taken per neuron at each iteration. Our methods were implemented on the core kernel of the NEURON scientific application. Asynchronicity and distributed memory space are provided by the HPX runtime system for the ParalleX execution model. Benchmark results demonstrate a superlinear speed-up that leads to a reduced runtime compared to the bulk synchronous execution, yielding a speed-up between 25% to 65% across different compute architectures, and in the order of 15% to 40% for distributed executions.
机译:现代异步运行时系统允许对大规模科学应用程序进行重新思考。以形态学详细的神经网络的模拟器为例,我们展示了与常用的批量同步并行(BSP)执行分离如何允许增加预取功能,更好的缓存局部性以及计算和通信的重叠,从而导致缩短解决时间。我们的策略消除了ODE耦合信息的集体同步操作,并利用了方程之间的成对时间依赖性,从而形成了完全异步的穷举而不是推测性的步进模型。结合完全线性的数据结构,在计算节点级别减少通信,并在最早的方程式步骤调度程序中进行加速,我们在高速缓存级别执行加速,从而通过最大化每次迭代中每个神经元采取的时间步数来减少通信和解决问题的时间。我们的方法是在NEURON科学应用程序的核心内核上实现的。 HPX运行时系统为ParalleX执行模型提供了异步性和分布式内存空间。基准测试结果表明,与批量同步执行相比,超线性加速可减少运行时间,跨不同计算体系结构的加速速度可提高25%至65%,而分布式执行的速度可提高15%至40% 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号