In Computational Linguistics, work towards understanding or generating language has been primarily based solely on textual information. However, when we humans process a text, be it written or spoken, we also take into account cues from the context in which such a text appears, in addition to our background and common sense knowledge. This is also the case when we translate text. For example, a news article will often contain images and may also contain a short video and/or audio clip. Users of social media often post photos and videos accompanied by short textual descriptions. The additional information can help minimise ambiguities and elicit unknown words. In this talk I will introduce a recent area of research that addresses the automatic translation of texts from rich context models that incorporate multimodal information, focusing on visual cues from images. I will cover some of our recent work analysing how humans perform translation in the presence/absence of visual cues and then move on to datasets and computational models proposed for this problem.
展开▼