VQA v2.0: A More Balanced and Bigger VQA Dataset!

What is VQA?

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

  • 265,016 images (COCO and abstract scenes)
  • At least 3 questions (5.4 questions on average) per image
  • 10 ground truth answers per question
  • 3 plausible (but likely incorrect) answers per question
  • Automatic evaluation metric

Subscribe to our group for updates!


Details on downloading the latest dataset may be found on the download webpage.

  • March 2017: Beta v1.9 release

Balanced Real Images
  • 123,287 COCO images
    (only train/val)
  • 658,111 questions
  • 6,581,110 ground truth answers
  • 1,974,333 plausible answers
Balanced Binary Abstract Scenes
  • 31,325 abstract scenes
    (only train/val)
  • 33,383 questions
  • 333,830 ground truth answers
  • October 2015: Full release (v1.0)

  • July 2015: Beta v0.9 release

  • June 2015: Beta v0.1 release


Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (CVPR 2017)

Download the paper


Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016)

Download the paper


VQA: Visual Question Answering (ICCV 2015)

Download the paper




Any feedback is very welcome! Please send it to visualqa@gmail.com.