Long-term temporal interactions among objects are an important cue for video understanding. To capture such object relations, we propose a novel method for spatiotemporal video segmentation based on dense trajectory clustering that is also effective when objects articulate. We use superpixels of homogeneous size jointly with optical flow information to ease the matching of regions from one frame to another. Our second main contribution is a hierarchical fusion algorithm that yields segmentation information available at multiple linked scales. We test the algorithm on several videos from the web showing a large variety of difficulties.
展开▼