Methods and systems for detecting visual relations in a video are disclosed. A method comprises: decomposing the video sequence into a plurality of segments; for each segment, detecting objects in frames of the segment; tracking the detected objects over the segment to form a set of object tracklets for the segment; for the detected objects, extracting object features; for pairs of object tracklets of the set of object tracklets, extracting relativity features indicative of a relation between the objects corresponding to the pair of object tracklets; forming relation feature vectors for pairs of object tracklets using the object features of objects corresponding to respective pairs of object tracklets and the relativity features of the respective pairs of object tracklets; and generating a set of segment relation prediction results from the relation features vectors; generating a set of visual relation instances for the video sequence by merging the segment prediction results from different segments; and generating a set of visual relation detection results from the set of visual relation instances.
展开▼