Download
VQA Annotations
Real Images
-
Training annotations 2015 v1.0
2,483,490 answers -
Validation annotations 2015 v1.0
1,215,120 answers -
Abstract Scenes
-
Training annotations 2015 v1.0
600,000 answers -
Validation annotations 2015 v1.0
300,000 answers
VQA Input Questions
- Training questions 2015 v1.0
248,349 questions - Validation questions 2015 v1.0
121,512 questions - Testing questions 2015 v1.0
244,302 questions
- Training questions 2015 v1.0
60,000 questions - Validation questions 2015 v1.0
30,000 questions - Testing questions 2015 v1.0
60,000 questions
VQA Input Images
Tools (Real & Abstract)
The captions for training and validation sets of the abstract scenes can be downloaded from here.
Overview
For every image, we collected 3 free-form natural-language questions with 10 concise open-ended answers each. We provide two formats of the VQA task: open-ended and multiple-choice. For additional details, please see the VQA paper.
The annotations we release are the result of the following post-processing steps on the raw crowdsourced data:
- Spelling correction (using Bing Speller) of question and answer strings
- Question normalization (first char uppercase, last char ‘?’)
- Answer normalization (all chars lowercase, no period except as decimal point, number words —> digits, strip articles (a, an the))
- Adding apostrophe if a contraction is missing it (e.g., convert "dont" to "don't")
Please follow the instructions in the README to download and setup the VQA data (annotations and images).
By downloading this dataset, you agree to our Terms of Use.
VQA API
getQuesIds - Get question ids that satisfy given filter conditions.
getImgIds - Get image ids that satisfy given filter conditions.
loadQA - Load questions and answers with the specified question ids.
showQA - Display the specified questions and answers.
loadRes - Load result file and create result object.
Here is a link to the python API demo script.
Input Questions Format
VQA currently has two different question formats: OpenEnded and MultipleChoice. The questions are stored using the JSON file format.
The OpenEnded format has the following data structure:
{
"info" : info,
"task_type" : str,
"data_type": str,
"data_subtype": str,
"questions" : [question],
"license" : license
}
info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}
license{
"name" : str,
"url" : str
}
question{
"question_id" : int,
"image_id" : int,
"question" : str
}
The MultipleChoice format has the same data structure as the OpenEnded format above, except it has the following two extra fields:
"num_choices": int
question{
"multiple_choices" : [str]
}
task_type
: type of annotations in the JSON file (OpenEnded/MultipleChoice).
data_type
: source of the images (mscoco or abstract_v002).
data_subtype
: type of data subtype (train2014/val2014/test2015/test-dev2015 for mscoco, train2015/val2015 for abstract_v002).
num_choices
: (only in MultipleChoice format) number of choices for each question (=18). For details on how the 18 choices are created, please see the VQA paper.
Annotation Format
There is a common annotation file for the Open-Ended and Multiple-Choice tasks. The annotations are stored using the JSON file format.
The annotations format has the following data structure:
{
"info" : info,
"data_type": str,
"data_subtype": str,
"annotations" : [annotation],
"license" : license
}
info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}
license{
"name" : str,
"url" : str
}
annotation{
"question_id" : int,
"image_id" : int,
"question_type" : str,
"answer_type" : str,
"answers" : [answer],
"multiple_choice_answer" : str
}
answer{
"answer_id" : int,
"answer" : str,
"answer_confidence": str
}
data_type
: source of the images (mscoco or abstract_v002).
data_subtype
: type of data subtype (train2014/val2014 for mscoco, train2015/val2015 for abstract_v002).
question_type
: type of the question determined by the first few words of the question. For details, please see README.
answer_type
: type of the answer. Currently, "yes/no", "number", and "other".
multiple_choice_answer
: correct multiple choice answer.
answer_confidence
: subject's confidence in answering the question. For details, please see the VQA paper.
Abstract Scenes and Captions
This section provides more information regarding abstract scenes' composition (e.g., the (x,y) pixel coordinates of each clipart object, left/right facing) files and abstract captions. If you are using any data (images, questions, answers, or captions) associated with abstract scenes, please cite the VQA paper. An example BibTeX is:
@InProceedings{VQA,
author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh},
title = {VQA: Visual Question Answering},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2015},
}
The following links contain the abstract scenes' composition files:
Each of the links above contains the following:
- A file of the type "abstract_v002_[datasubset]_scene_information.json" where [datasubset] is either "train2015" or "val2015" or "test2015". This file has the following data structure:
- A folder of the type "scene_composition_abstract_v002_[datasubset]" where [datasubset] is either "train2015" or "val2015" or "test2015". This folder contains the scene composition files for the corresponding [datasubset].
{
"info" : info,
"data_type": str,
"data_subtype": str,
"compositions" : [composition],
"images" : [image],
"license" : license
}
info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}
license{
"name" : str,
"url" : str
}
image{
"image_id" : int,
"file_name" : str,
"url" : str,
"height" : int,
"width" : int
}
composition{
"image_id" : int,
"file_name" : str
}
data_type
: source of the images (abstract_v002).
data_subtype
: type of data subtype (train2015/val2015/test2015).
The file_name
in images
list contains the name of the image file for the corresponding abstract scene. These image files can be downloaded from the links provided in the "Download" section in this page.
The file_name
in compositions
list contains the name of the scene composition file for the corresponding abstract scene (see the bullet below).
For more information on how to render the scenes from annotation files and to obtain API support for abstract scenes, please visit the GitHub repository.
The JSON files containing the captions for training and validation sets of the abstract scenes can be downloaded from the link provided in the "Download" section in this page. These files have the following data structure:
{
"info" : info,
"task_type": str,
"data_type": str,
"data_subtype": str,
"annotations" : [annotation],
"images" : [image],
"license" : license
}
info {
"year" : int,
"version" : str,
"description" : str,
"contributor" : str,
"url" : str,
"date_created" : datetime
}
license{
"name" : str,
"url" : str
}
image{
"image_id" : int,
"file_name" : str,
"url" : str,
"height" : int,
"width" : int
}
annotation{
"id" : int,
"image_id" : int,
"caption" : str
}
task_type
: Captioning.
data_type
: dataset source of the images (abstract_v002).
data_subtype
: type of datasubset (train2015/val2015).