Image description generation, or image captioning (IC), is the task of automatically generating a textual description for a given image. The generated text is expected to describe, generally in a single sentence, what is visually depicted in the image, for example the entities/objects present in the image, their attributes, the actions/activities performed, entity/object interactions (including quantification), the location/scene, etc. (e.g. "a man riding a bike on the street"). Significant progress has been made with end-to-end approaches to tackling this problem, where parallel image-description datasets such as Flickr30k (Young et al., 2014) and MSCOCO (Chen et al., 2015) are used to train a CNN-RNN based neural network IC system (Vinyals et al., 2017; Karpathy and Fei-Fei, 2015; Xu et al., 2015). Such systems have demonstrated impressive performance in the COCO captioning challenge according to automatic metrics, seemingly even surpassing human performance in many instances (e.g. CIDEr score > 1.0 vs. human's 0.85) (Chen et al., 2015). However, in reality, the performance of end-to-end systems is still far from satisfactory according to metrics based on human judgement. This task is thus currently far from being a solved problem.
展开▼