A system and a method for multi-modal fusion based fault tolerant video content recognition is disclosed. The method conducts multi-modal recognition on an input video to extract multiple components and their respective appearance time in the video. Next, the multiple components are categorized and recognized respectively via different algorithms. Next, when the recognition confidence of any component is insufficient, a cross-validation with other components is performed to increase the recognition confidence and improve the fault tolerance of the components. Furthermore, when the recognition confidence of an individual component is insufficient, the recognition continues and tracks the component, spatially and temporally when it applies, until frames of high recognition confidence in the continuous time period is reached. Finally, multi-modal fusion is performed to summarize and resolve any recognition discrepancies between the multiple components, and to generate indices for every time frame for the ease of future text-based queries.
展开▼