What is VQA?

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

  • 265,016 images (COCO and abstract scenes)
  • At least 3 questions (5.4 questions on average) per image
  • 10 ground truth answers per question
  • 3 plausible (but likely incorrect) answers per question
  • Automatic evaluation metric

Subscribe to our group for updates!


Details on downloading the latest dataset may be found on the download webpage.

  • April 2017: Full release (v2.0)

Balanced Real Images
  • 204,721 COCO images
    (all of current train/val/test)
  • 1,105,904 questions
  • 11,059,040 ground truth answers
  • March 2017: Beta v1.9 release

  • October 2015: Full release (v1.0)

  • July 2015: Beta v0.9 release

  • June 2015: Beta v0.1 release


Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (CVPR 2017)

Download the paper


Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016)

Download the paper


VQA: Visual Question Answering (ICCV 2015)

Download the paper




Any feedback is very welcome! Please send it to visualqa@gmail.com.