In most contemporary work in deep reinforcement learning (DRL), agents are trained in simulated environments. Not only are simulated environments fast and inexpensive, they are also 'safe'. By contrast, training in a real world environment (using robots, for example) is not only slow and costly, but actions can also result in irreversible damage, either to the environment or to the agent (robot) itself. In this paper, we consider taking advantage of the inherent safety in computer simulation by extending the Deep Q-Network (DQN) algorithm with an ability to measure and take risk. In essence, we propose a novel DRL algorithm that encourages risk-seeking behaviour to enhance information acquisition during training. We demonstrate the merit of the exploration heuristic by (ⅰ) arguing that our risk estimator implicitly contains both parametric uncertainty and inherent uncertainty of the environment which are propagated back through Temporal Difference error across many time steps and (ⅱ) evaluating our method on three games in the Atari domain and showing that the technique works well on Montezuma's Revenge, a game that epitomises the challenge of sparse reward.
展开▼