A machine learning device (2) that learns an operating program of a robot (1), comprising: a state observation unit (21) which reads as a state variable a wobbling of an arm (11) of the robot (1) and / or a length of an operating trajectory of the arm (11) observes the robot (1); a determination data acquisition unit (22) that acquires, as determination data, a cycle time in which the robot (1) performs processing; and a learning unit (23) that learns the operation program of the robot (1) based on an output of the state observation unit (21) and an output of the determination data acquisition unit (22), the learning unit (23) comprising: a reward calculation unit (231) having a Calculating reward based on the output of the state observation unit (21) and the output of the determination data acquisition unit (22); anda value function update unit (232) that updates a value function that changes a value of the operating program of the robot (1) based on the output of the state observation unit (21), the output of the determination data acquisition unit (22) and an output of the reward calculation unit (231) and the reward calculating unit (231) sets a negative reward when the cycle time is longer than a previously obtained cycle time and a positive reward when the cycle time is shorter than a previously obtained cycle time.
展开▼