A system is configured to label computer vision datasets using eye tracking of users that track objects depicted in imagery to label the datasets. The imagery may include moving images (e.g., video) or still images. By using eye tracking, users may be able to label large amounts of imagery more efficiently than by manually labeling datasets using conventional input devices. A user may be instructed to watch a particular object during a playback of the video while an imaging device determines a direction of the user's gaze which correlates with a location in the imagery. An application may then associate the location in the imagery determined from the user's gaze as a location of the object on a frame-by-frame basis, or for certain frames.
展开▼