...
首页> 外文期刊>Machine Learning >Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning
【24h】

Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning

机译:使用强化学习来学习在重复游戏中的竞争,协调和合作

获取原文
获取原文并翻译 | 示例

摘要

We consider the problem of learning in repeated general-sum matrix games when a learning algorithm can observe the actions but not the payoffs of its associates. Due to the non-stationarity of the environment caused by learning associates in these games, most state-of-the-art algorithms perform poorly in some important repeated games due to an inability to make profitable compromises. To make these compromises, an agent must effectively balance competing objectives, including bounding losses, playing optimally with respect to current beliefs, and taking calculated, but profitable, risks. In this paper, we present, discuss, and analyze M-Qubed, a reinforcement learning algorithm designed to overcome these deficiencies by encoding and balancing best-response, cautious, and optimistic learning biases. We show that M-Qubed learns to make profitable compromises across a wide-range of repeated matrix games played with many kinds of learners. Specifically, we prove that M-Qubed's average payoffs meet or exceed its maximin value in the limit. Additionally, we show that, in two-player games, M-Qubed's average payoffs approach the value of the Nash bargaining solution in self play. Furthermore, it performs very well when associating with other learners, as evidenced by its robust behavior in round-robin and evolutionary tournaments of two-player games. These results demonstrate that an agent can learn to make good compromises, and hence receive high payoffs, in repeated games by effectively encoding and balancing best-response, cautious, and optimistic learning biases.
机译:当学习算法可以观察到动作但不能观察到其同伴的收益时,我们考虑重复的一般和矩阵游戏中的学习问题。由于这些游戏中学习助手造成的环境不稳定,大多数最新算法在某些重要的重复游戏中表现不佳,原因是无法做出有利可图的折衷。为了做出这些折衷,代理商必须有效地平衡相互竞争的目标,包括有限的损失,相对于当前信念的最佳发挥以及承担经过计算但可获利的风险。在本文中,我们介绍,讨论和分析M-Qubed,这是一种强化学习算法,旨在通过编码和平衡最佳响应,谨慎和乐观学习偏见来克服这些缺陷。我们表明,M-Qubed在与许多学习者一起玩的一系列重复矩阵游戏中学会了做出有利可图的妥协。具体来说,我们证明M-Qubed的平均收益达到或超过其极限值的最大值。此外,我们证明,在两人游戏中,M-Qubed的平均收益接近于纳什讨价还价解决方案的价值。此外,它在与其他学习者建立联系时的表现也非常好,这在两人游戏的循环赛和进化赛中表现出色。这些结果表明,代理商可以通过有效编码和平衡最佳响应,谨慎和乐观的学习偏见,在反复的游戏中学会做出良好的折衷,从而获得较高的回报。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号