Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

Eleni Vasilaki; Nicolas Frémaux; Robert Urbanczik; Walter Senn; Wulfram Gerstner

首页> 外文期刊>PLoS Computational Biology >Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

【24h】

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

机译：连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties.

机译：神经元之间的突触连接的变化被认为是学习的生理基础。这些变化可以由编码奖励存在的神经调节剂控制。我们研究了一系列奖励调制的突触学习规则，这些规则在受莫里斯·水迷宫启发的连续空间中，在学习任务中突显神经元。突触更新规则修改突触传递的释放概率，并取决于突触前突波到达的时间，突触后动作电位以及突触后神经元的膜电位。学习规则族包括从策略梯度方法以及奖励调制的Hebbian学习派生的最佳规则。使用将前馈输入与横向连接相结合的网络体系结构，在尖峰神经元群体中实现突触更新规则。动作由具有强大的墨西哥帽连接性的假设动作单元组成，并以theta频率读出。我们证明，在这种体系结构中，标准策略梯度规则无法解决莫里斯水迷宫任务，而具有希伯来偏见的变体可以在20个试验中学习该任务，与实验一致。此结果不依赖于实现细节，例如神经元群体的大小。我们的理论方法表明，如何在单个突触水平上将学习新行为与奖励调制的可塑性联系起来，并对突触可塑性的电压和尖峰时序依赖性以及神经调节剂（如多巴胺）的影响做出预测。这是将强化学习的正式理论与神经元和突触特性联系起来的重要一步。

著录项

来源
《PLoS Computational Biology》 |2009年第12期|共17页
作者
Eleni Vasilaki; Nicolas Frémaux; Robert Urbanczik; Walter Senn; Wulfram Gerstner;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类细胞生物学;
关键词

相似文献

外文文献
中文文献
专利

1. Policy Derivation Methods for Critic-Only Reinforcement Learning in Continuous Action Spaces [J] . Eduard Alibekov, Jiri Kubalik, Robert Babuska IFAC PapersOnLine . 2016,第5期

机译：连续动作空间中仅用于批判性强化学习的策略推导方法
2. Policy derivation methods for critic-only reinforcement learning in continuous spaces [J] . Eduard Alibekov, Jiří Kubalík, Robert Babuška Engineering Applications of Artificial Intelligence . 2018,第MARa期

机译：连续空间中仅限批评家的强化学习的策略推导方法
3. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation [J] . Zhong Shan, Liu Quan, Zhang Zongzhang, Frontiers of computer science in China . 2019,第1期

机译：使用Dyna和策略逼近在连续状态和动作空间中进行有效的强化学习
4. A comparison of action selection methods for implicit policy method reinforcement learning in continuous action-space [C] . Barry D. Nichols International Joint Conference on Neural Networks . 2016

机译：连续动作空间中隐式策略方法强化学习的动作选择方法比较
5. Learning control policies from demonstration in continuous sensory and action space. [D] . McLeod, Adam M. 2015

机译：通过在连续的感官和动作空间中的演示来学习控制策略。
6. Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [O] . Eleni Vasilaki, Nicolas Frémaux, Robert Urbanczik, 2009

机译：更正：在连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时
7. Correction: Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail [O] . Vasilaki, Eleni, Frémaux, Nicolas, Urbanczik, Robert, 2009

机译：更正：在连续状态和动作空间中基于峰值的强化学习：当策略梯度方法失败时

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

摘要

著录项

相似文献

相关主题

期刊订阅