Due to the rise of deep learning, reasoning across various domains, such as vision, language, robotics, and control, has seen major progress in recent years. A popular benchmark for evaluating models for visual reasoning is Visual Question Answering (VQA), which aims at answering questions about a given input image by joining the two modalities: (1) the text representing the question, as well as, (2) the visual information extracted from the input image. In this work, we propose a structured approach for VQA that is based on dynamic graphs learned automatically from the input. Unlike the common approach for VQA that relies on an attention mechanism applied on a cell-structured global embedding of the image, our model leverages the rich structure in the image depicted in the object instances and their interaction. In our model, nodes in the graph correspond to object instances present in the image while the edges represent relations among them. Our model automatically constructs the scene graph and attends to the relations among the nodes to answer the given question. Hence, our model can be trained end-to-end and it does not require additional training labels in the form of predefined graphs or relations. We demonstrate the effectiveness of our approach on the challenging open-ended Visual Genome benchmark for VQA.
展开▼