...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Path Consistency Learning in Tsallis Entropy Regularized MDPs
【24h】

Path Consistency Learning in Tsallis Entropy Regularized MDPs

机译:Tsallis熵正则化MDP中的路径一致性学习

获取原文
   

获取外文期刊封面封底 >>

       

摘要

We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e., at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is softmax, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called sparse PCL, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a sparse consistency equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions.
机译:我们研究稀疏熵正则化强化学习(ERL)问题,其中熵项是Tsallis熵的一种特殊形式。这种表述的最佳策略是稀疏的,即,在每种状态下,它仅对少量动作具有非零概率。这解决了标准的Shannon熵正则化RL(软ERL)公式的主要缺点,其中最佳策略是softmax,因此可以将不可忽略的概率质量分配给非最优动作。随着动作数量的增加,这个问题变得更加严重。在本文中,我们遵循Nachum等人的工作。 (2017年)在软ERL设置中,针对可同时适用于政策上和政策外数据的稀疏ERL问题,提出了一类新颖的路径一致性学习(PCL)算法,称为稀疏PCL。我们首先导出一个稀疏一致性方程,该方程指定了沿任何系统轨迹的最优值函数和稀疏ERL策略之间的关系。至关重要的是,相反形式的弱形式也是正确的,我们量化了满足稀疏一致性的策略的次优性,并表明随着行动数量的增加,这种次优性优于软ERL的次优性。最佳政策。然后,我们使用此结果来得出稀疏PCL算法。我们根据经验将稀疏PCL与它的软对应项进行比较,并显示其优势,尤其是在存在大量动作的问题中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号