An object of the present invention is to obtain a state or a state transition rule for selecting an action even in an environment where the state or the state transition rule is unknown. A state acquisition unit acquires an environmental state after an action when a selected action is performed, and a reward calculation unit performs a state based on the acquired state and the selected action. The reward at the time of performing the action is calculated, and the parameter updating unit 240 takes the state as an input based on the selected action and the reward, updates the parameters of the model for selecting the action, and the action selecting unit 270 However, with the post-action state as input, the model is used to select the action, and acquisition, calculation, updating, and selection are repeated until the iteration end condition is satisfied, and the state acquisition unit 210 has already acquired the state. Compared with the acquired set of states, if the state is new, the acquired state is added to the set of states, and the state transition rule is acquired based on the set of states. [Selected figure] Figure 2
展开▼