Multi-Scale Relation Reasoning for Multi-Modal Visual Question Answering

摘要

The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of the VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we first design a regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by the regional attention scheme are fused in scaled combinations, generating more distinctive features with scalable information. Due to the designs of regional attention and multi-scale property, the proposed method is capable of describing scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on the VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiency compared to most existing methods.

出版物
Signal Processing Image Communication
巫义锐
巫义锐
青年教授, CCF 高级会员

My research interests include Computer Vision, Artifical Intelligence, Multimedia Computing and Intelligent Water Conservancy.