首页> 外文期刊>PLoS Computational Biology >Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail
【24h】

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

机译:连续状态和动作空间中基于峰值的强化学习:当策略梯度方法失败时

获取原文
           

摘要

Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties.
机译:神经元之间的突触连接的变化被认为是学习的生理基础。这些变化可以由编码奖励存在的神经调节剂控制。我们研究了一系列奖励调制的突触学习规则,这些规则在受莫里斯·水迷宫启发的连续空间中,在学习任务中突显神经元。突触更新规则修改突触传递的释放概率,并取决于突触前突波到达的时间,突触后动作电位以及突触后神经元的膜电位。学习规则族包括从策略梯度方法以及奖励调制的Hebbian学习派生的最佳规则。使用将前馈输入与横向连接相结合的网络体系结构,在尖峰神经元群体中实现突触更新规则。动作由具有强大的墨西哥帽连接性的假设动作单元组成,并以theta频率读出。我们证明,在这种体系结构中,标准策略梯度规则无法解决莫里斯水迷宫任务,而具有希伯来偏见的变体可以在20个试验中学习该任务,与实验一致。此结果不依赖于实现细节,例如神经元群体的大小。我们的理论方法表明,如何在单个突触水平上将学习新行为与奖励调制的可塑性联系起来,并对突触可塑性的电压和尖峰时序依赖性以及神经调节剂(如多巴胺)的影响做出预测。这是将强化学习的正式理论与神经元和突触特性联系起来的重要一步。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号