We present a novel approach for image semantic segmentation of street scenes into coherent regions, while simultaneously categorizing each region as one of the predefined categories representing commonly encountered object and background classes. We formulate the segmentation on small blob-based superpixels and exploit a visual vocabulary tree as an intermediate image representation. The main novelty of this generative approach is the introduction of an explicit model of spatial co-occurrence of visual words associated with super-pixels and utilization of appearance, geometry and contextual cues in a probabilistic framework We demonstrate how individual cues contribute towards global segmentation accuracy and how their combination yields superior performance to the best known method on the challenging benchmark dataset which exhibits diversity of street scenes with varying viewpoints, large number of categories, captured in daylight and dusk.
展开▼