A key task to understand an image and itsudcorresponding caption is not only to findudout what is shown on the picture and describedudin the text, but also what is theudexact relationship between these two elements.udThe long-term objective of ourudwork is to be able to distinguish differentudtypes of relationship, including literaludvs. non-literal usages, as well as finegrainedudnon-literal usages (i.e., symbolicudvs. iconic). Here, we approach this challengingudproblem by answering the question:ud‘How can we quantify the degreesudof similarity between the literal meaningsudexpressed within images and their captions?’.udWe formulate this problem as audranking task, where links between entitiesudand potential regions are created andudranked for relevance. Using a RankingudSVM allows us to leverage from the preferenceudordering of the links, which help usudin the similarity calculation for the casesudof visual or textual ambiguity, as well asudmisclassified data. Our experiments showudthat aggregating different features using audsupervised ranker achieves better resultsudthan a baseline knowledge-base method.udHowever, much work still lies ahead, andudwe accordingly conclude the paper with auddetailed discussion of a short- and longtermudoutlook on how to push our work onudrelationship classification one step further.
展开▼