PROBLEM TO BE SOLVED: To implement a function for learning a decision-making model while suppressing an unnecessary increase in mixing time.SOLUTION: A technique for updating a parameter (policy parameter) defining a policy under a Markov decision process system environment includes updating the policy parameter according to an update equation. The update equation includes a term for decreasing a weighted sum (weighted expected hitting time sum) over a first state (s) and a second state (s') of a statistic (expected hitting time function) on the number of steps (hitting time) required to make a first state transition from the first state (s) to the second state (s').
展开▼