Top-down visual attention mechanisms have been used extensively in imagecaptioning and visual question answering (VQA) to enable deeper imageunderstanding through fine-grained analysis and even multiple steps ofreasoning. In this work, we propose a combined bottom-up and top-down attentionmechanism that enables attention to be calculated at the level of objects andother salient image regions. This is the natural basis for attention to beconsidered. Within our approach, the bottom-up mechanism (based on FasterR-CNN) proposes image regions, each with an associated feature vector, whilethe top-down mechanism determines feature weightings. Applying this approach toimage captioning, our results on the MSCOCO test server establish a newstate-of-the-art for the task, improving the best published result in terms ofCIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating thebroad applicability of the method, applying the same approach to VQA we obtainfirst place in the 2017 VQA Challenge.
展开▼