Announcing VQA v2.0
A More Balanced and Bigger VQA Dataset


Paper



Download the paper

BibTeX


VQA v2.0 Dataset teaser (More examples)


The VQA (v1.0) dataset, introduced in Antol et al. ICCV 2015, has received a lot of attention from the community (academic groups, industry research labs, startups, etc.). While it has prompted significant progress in the visual question answering space, it has strong language priors. For instance, machine answers (“guesses”) based on the question alone – ignoring the image entirely – can achieve ~51% accuracy. In other words, at least half the questions can be answered correctly without even looking at the image!

The VQA v2.0 dataset will counter this language bias. Our idea in creating this new “balanced” VQA dataset is the following:
For every (image, question, answer) triplet (I,Q,A) in the VQA dataset, we identify an image I’ that is similar to I, and has an answer A' to Q such that A and A’ are different. Since I and I’ are semantically similar, a VQA model will have to understand the subtle differences between I and I’ to provide the right answer to both images. It cannot succeed as easily by making “guesses” based on the language alone.

To identify such an I’, Amazon Mechanical Turk workers are shown 24 nearest neighbor images to I. Workers are also shown the question Q, the answer A, and asked to pick an image I’ from the 24 for which Q “makes sense” and the answer to Q is NOT A. By “makes sense”, we mean the premise of the question must be satisfied. For instance, “What is the woman doing?” does not make sense for an image that does not have a woman in it. Image neighbors are computed in the penultimate layer activation space of a convolutional neural network (CNN).

We then have 10 other independent workers answer Q for I’ to match the setup of the VQA v1.0 dataset that has 10 answers for every (I,Q) pair.

This results in pairs of complementary scenes I and I’ that are semantically similar, but have different answers to the same question Q.

Some examples from a preliminary version of the VQA v2.0 dataset are shown below.




VQA v2.0 Dataset teaser (More examples)


Interface


You can play with the live interface here.


Feedback

Any feedback is very welcome! Please send it to visualqa@gmail.com.


People


Yash Goyal
Virginia Tech


Tejas Khot
Virginia Tech


Douglas Summers-Stay
Army Research Lab


Dhruv Batra
Georgia Tech


Devi Parikh
Georgia Tech

Crowd workers
Amazon Mechanical Turk