Despite the availability of a huge amount of video data accompanied bydescriptive texts, it is not always easy to exploit the information containedin natural language in order to automatically recognize video concepts. Towardsthis goal, in this paper we use textual cues as means of supervision,introducing two weakly supervised techniques that extend the Multiple InstanceLearning (MIL) framework: the Fuzzy Sets Multiple Instance Learning (FSMIL) andthe Probabilistic Labels Multiple Instance Learning (PLMIL). The former encodesthe spatio-temporal imprecision of the linguistic descriptions with Fuzzy Sets,while the latter models different interpretations of each description'ssemantics with Probabilistic Labels, both formulated through a convexoptimization algorithm. In addition, we provide a novel technique to extractweak labels in the presence of complex semantics, that consists of semanticsimilarity computations. We evaluate our methods on two distinct problems,namely face and action recognition, in the challenging and realistic setting ofmovies accompanied by their screenplays, contained in the COGNIMUSE database.We show that, on both tasks, our method considerably outperforms astate-of-the-art weakly supervised approach, as well as other baselines.
展开▼