Monte-Carlo Tree Search (MCTS) has achieved great success in combinatorial game, which has the characteristics of finite action state space, deterministic state transition and sparse reward. AlphaGo Zero combined MCTS and deep neural networks defeated the world champion Lee Sedol in the Go game, proving the advantages of tree search in combinatorial game with enormous search space. However, when the search space is continuous and even with chance factors, tree search methods like UCT failed. Because each state will be visited repeatedly with probability zero and the information in tree will never be used, that is to say UCT algorithm degrades to Monte Carlo rollouts. Meanwhile, the previous exploration experiences cannot be used to correct the next tree search process, and makes a huge increase in the demand for computing resources. To solve this kind of problem, this paper proposes a step-by-step Reverse Curriculum Learning with Truncated Tree Search method (RevCuT Tree Search). In order to retain the previous exploration experiences, we use the deep neural network to learn the state-action values at explored states and then guide the next tree search process. Besides, taking the computing resources into consideration, we establish a truncated search tree focusing on continuous state space rather than the whole trajectory. This method can effectively reduce the number of explorations and achieve the effect beyond the human level in our well designed single-player game with continuous state space and probabilistic state transition.
展开▼
机译:Monte-Carlo树搜索(MCT)在组合游戏中取得了巨大成功,这具有有限的动作状态空间,确定性状态过渡和稀疏奖励的特点。 alphago零组合MCT和深神经网络在GO游戏中击败了世界冠军李塞多尔,证明了树立搜索与巨大的搜索空间的组合游戏的优势。但是,当搜索空间是连续的甚至有机会因素时,树搜索方法如UCT失败。因为将以概率为零访问每个州,因此永远不会使用树中的信息,也就是说,UCT算法降低到Monte Carlo卷展栏。同时,以前的探索经验不能用于纠正下一个树搜索过程,并在计算资源的需求方面取得巨大增加。要解决此类问题,本文提出了具有截断的树搜索方法(RevCut Tree Search)的逐步反向课程学习。为了保留以前的探索经历,我们使用深神经网络在探索状态下学习状态动作值,然后引导下一个树搜索过程。此外,考虑到计算资源,我们建立了一个截断的搜索树,专注于连续的状态空间而不是整个轨迹。这种方法可以有效地减少探索的数量,并在我们精心设计的单人游戏中实现了与连续的状态空间和概率状态转换的人类水平的影响。
展开▼