Point-based value iteration methods are a family of effective algorithms for solving POMDP models and their performance mainly depends on the exploration of the search space. Although global optimization can be obtained by algorithms such as HSVI and GapMin, their exploration of the optimal action is overly optimistic which therefore slows down the efficiency. In this paper, we propose a novel heuristic search method POPVI (Probability-based Optimal Policy Value Iteration) which explores the optimal action based on probability. In depth-first heuristic exploration, this algorithm uses a Monte-Carlo method to estimate the probabilities that actions are optimal according to the distribution of actions' Q-value function, applies the action of the maximum probability and greedily explores subsequent belief point of the greatest uncertainty. Experimental results show that POPVI outperforms HSVI, and by a large margin when the scale of the POMDP increases.
展开▼