VQA: Visual Question Answering

1 Million Full-Sentences Visual Question Answering (FSVQA)

Full-Sentence Visual Question Answering (FSVQA) consists of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to the original VQA dataset and captions in the MS COCO dataset.

VQA Human Attention (VQA-HAT)

The VQA-HAT dataset consists of ~60k attention annotations from humans of where they choose to look while answering questions about images, collected via a game-inspired interface that requires subjects to sharpen regions of a blurred image to answer a question.

Easy Visual Question Answering (easy-VQA)

The easy-VQA dataset is a beginner-friendly way to get started — a “Hello World” for VQA. It contains 5k simple, geometric images and 48k questions with only 13 possible answers. Extremely accurate models can be trained on this dataset relatively quickly, even with unimpressive hardware.

Datasets built on top of VQA