A digital medium environment includes an action processing application that performs actions including personalized recommendation. A learning algorithm operates on a sample-by-sample basis (e.g., each instance a user visits a web page) and recommends an optimistic action, such as an action found by maximizing an expected reward, or a base action, such as an action from a baseline policy with known expected reward, subject to a safety constraint. The safety constraint requires that the expected performance of playing optimistic actions is at least as good as a predetermined percentage of the known performance of playing base actions. Thus, the learning algorithm is conservative during exploratory early stages of learning, and does not play unsafe actions. Furthermore, since the learning algorithm is online and can learn with each sample, it converges quickly and is able to track time varying parameters better than learning algorithms that learn on a block basis.
展开▼