Recently we proposed a new exploration technique for individual reinforcement learners, which helps them to coordinate on the Pareto Optimal Nash equilibrium of a game. This technique in which agents may exclude one or more of the actions from their action space, can be seen as a discrete version of the traditional ε-greedy exploration technique. In this paper we refine this exploration technique further, with a standard technique from general search problems, i.e. random restarts. Due to this refinement, we are able to prove convergence to the Pareto Optimal Nash equilibrium in general stochastic common interest games. Moreover communication becomes unnecessary. Experiments show this technique on 2 challenging test problems and examine it's use in larger joint action spaces.
展开▼