Can AI help computers to see?

mufaddal Blog December 16, 2020 | 0

Visual Question Answering (VQA) : Part-1 EDA

Overview

In recent years we have witnessed significant progress in the field of AI, such as computer vision and language understanding. These progresses motivated researchers to address a more challenging problem: Visual Question Answering (VQA).

Before we jump to what VQA is, let’s try to answer some questions from the image below:

Q: What is the colour of the wall?

Ans: Blue

Q: How many paintings are there on the wall?

Ans: 3

Pretty easy right??

Now imagine you’ve a smartphone or a computer or any other smart device and you want to ask that device or that computer to test how intelligent it is by throwing this image at it and asking the same question.

What do you expect? Will it be able to answer ??

Pretty difficult to answer right??

This type of problem is called Visual Question Answering (VQA).

VQA is defined as follows- an image along with a question about that image are the input to the AI system, and the intelligent system is supposed to output a correct answer to the question with respect to the input image.

Visual Question Answering(VQA) system takes as input an image and a free-form, open-ended, natural- language question about the image and produces a natural- language answer as the output (as shown in the example above).

As seen in the example above, answering questions about visual content requires a variety of skills including recognizing entities and objects, reasoning about their interactions with each other, both spatially and temporally, reading text, parsing audio, interpreting abstract and graphical illustrations as well as using external knowledge not directly present in the given content.

Business Use of VQA

As humans, it is easy for us to see an image and answer any question about it using our commonsense. However, there are also scenarios, where a visually-impaired user or an intelligence analyst want to actively elicit visual information given an image.

What help would VQA do in such a case?

The VQA extensions help explore new potential applications such as generic object recognition, holistic scene understanding, narrating information and stories from images, or developing interactive educational applications that ask questions about images.

VQA is interesting because it requires combining visual and language understanding.

A model that solves this task demonstrates a more general understanding of images:

1. It must be able to answer completely different questions about an image, oftentimes even addressing different sections of the image.

Hence, we build an algorithm that takes as input an image and a natural language question about the image and generates a natural language answer as the output.

The system will answer a question similar to humans in the following aspects:

It will learn the visual and textual knowledge from the inputs (image and question respectively)
Combine the two data streams
Use this advanced knowledge to generate the answer

Take for example the following sentence:

How many bridges are there in London?

A Natural Language Processing Q & A program is typically going to:

Classify the question- this is a ‘how many’ question. The answer must be in numbers.
Extract the object to count ‘bridges’
Extract the context where the count must be performed: in this case, London.

The relevant difference in VQA is that the search and the reasoning part must be performed over the content of an image. The system must be able to detect objects, classify scenes and recognise objects.

Understanding the data

As we know there are many well defined datasets available for problems in NLP or Computer vision for tasks like Object detection, Machine translation, Image captioning, etc. These datasets allow, in combination with well defined metrics, to fairly compare different approaches, to compare them with human decisions and to measure how they perform in an absolute way; to determine the empirical limitations of the state-of-the-art.

However, The VQA field is so complex that a good dataset should be large enough to capture the long range of possibilities within questions and image content in real world scenarios.

Dataset overview:

The data that we are using is from the official VQA data dump.
We will perform only one type of task from this dataset: Open-ended visual question answering

VQA v2.0 release details:

(This release consists of Real images)

82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images (images are obtained from MS COCO website
443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing
4,437,570 answers for training and 2,143,540 answers for validation (10 per question)

How to download the data?

We created a function to make things easy. Just by using the below function you can download the data in your local system.

Just by calling the above function all the data will be downloaded in the local folder.
Once it is done, we can use the .json files to extract the information.

Dataset Overview:

Annotations overview:

After you have downloaded the data (which is in .json format) the next step is to inspect the data and understand exactly what it contains-

Load the data as a dict object using the json module.
For Eg:

Now let’s understand what we have here.

The list contains the following descriptions embedded in a dict:

Description:

The first line describes the type of question this annotation answers.
Second line tells about the actual answer to the question.
The 3rd, 4th and 5th lines are the meta data required for processing the annotation.
There are multiple such dictionaries present for all the images.
We can have many such (questions, answers) pairs for each image ID.

Questions overview:

Just the way you downloaded the annotations and loaded it in a dict object, the same will be done for the questions as well.

The questions dict looks something like this:

Description:

Each dictionary in the questions list contains three keys. “Image_id”, “question”, “question_id”
We can do a join on these IDs to get the respective question for that image and the annotation.
We will just pick the annotations key from the “train_anno” variable and questions key from the “train_ques”.

Steps before performing EDA

STEP : 1

Let’s just start with loading the annotations and questions in a pandas DataFrame.

It looks like this:

Some inferences that we can get after performing some basic pandas experiments:

1. There are no null values in the dataset.
2. The annotations file contains 6 columns.

a. [‘answer_type’, ‘answers’, ‘image_id’, ‘multiple_choice_answer’, ‘question_id’, ‘question_type’]

3. The questions file contains 3 columns.

a. [‘image_id’, ‘question’, ‘question_id’]

For every question there are 10 answers with an answer_confidence. Averaging the answer confidence the final answer has been decided as the multiple_choice_answer column.

STEP : 2

Split the data into training, test and validation. We already have the dataset separated for each split. We just need to download it and load the same in respective variables. For that we have created a function which will do the needful.

STEP : 3

After we have run the above step, all the questions will be merged with the respective annotations and answers. Once that is done, load the train data in a DataFrame variable.

It looks like this:

Inference:

Number of datapoints the dataset : 443757
Number of unique question id : 443757
Number of unique question types : 65
Number of unique questions in the dataset : 152050
Number of unique answers types : 3
Number of unique answers : 22531
Number of unique images : 82783
There are no Null values in the dataset.

Univariate Analysis for Questions

This is where we learn about the bits and pieces of the entire dataset. We will start with basic questions that pop up in our head and then move to different more complex questions.

To start with we will first analyse the questions.

1. What is the distribution of the duplicate questions?

Inference:

Out of 443757 training examples, 291707 contain duplicate questions. That’s roughly 66%
Most of the questions are repeated for some different images because there are 82,783 images in the training dataset and for each image there are multiple questions.
This means there are similar types of images as well to which the same questions are being asked. This is why we get so many repeated questions.
For eg:

2. How many questions are there per question type?

Inference:

We want to make sure that data is balanced with all types of questions included in it so that the model doesn’t overfit to a specific question type.
The pie chart shows that there is a proper distribution of question type and no single question type is dominant over the other.

3. Distribution of questions by their first four words

Inference:

This plot tells us about what kind of question words are generally used in the beginning of the question.
It tells about the distribution of the questions type and the words following the question word.

4. What are the most repeating questions and what is it’s distribution?

Inference:

By this we can understand that there are images which have multiple such scenes.
This shows that the questions might repeat for many images which have similar scenes in them.
For eg:

Q: What room is this?
Ans. 3

The more a question repeats the better chance the model has to learn the image understanding and what kind of questions can be answered by this question.

5. Distribution of word length in the questions

Inference:

As we can see from the PDF that most of the questions are of length 5 words (more than 70%)
There are very few questions with few words or very large numbers of words.
Most questions range from 5 to 7.5 words
Maximum length of questions : 22
Minimum length of questions : 2
Mean length of questions : 6

6. Word cloud will show the most frequent words in the question corpus

Inference:

It shows that words like “People”, “Picture” etc are present in a large amount.
Word Cloud tells about the frequency of a word in a given corpus. The more frequent the word is, the larger it appears.
Using this we can inference the repetition of words in the question corpus.

7. The distribution of duplicate and non duplicate question types

Inference:

This shows that almost all the questions are starting with one of the given starting phrase except for very very few questions with unique question_type
The unique question types will create some kind of variance while training the model because there are very few examples of them which will result in underfitting of the model with such question types.
We will remove such non-duplicate question types.

8. Top 50 most repeating starting phrase

Inference:

This shows that almost all the questions are starting with one of the given starting phrase except for very very few questions with unique question_type

9. Top 50 questions:

10. Number of questions per average confidence

Inference:

On average each question has 2.70 unique answers for real images
The agreement is significantly higher (> 95%) for “yes/no” questions
It is lower for other questions (< 76%), possibly due to the fact that we perform exact string matching and do not account for synonyms, plurality.

CONCLUSION FOR QUESTION ANALYSIS:

There are multiple questions asked from the same Image.
The questions are repeating because there will be so many images which have a similar scene which is also present in some other image. This is helpful because the model will be able to understand different images and the kinds of questions that are being asked in different scenarios.
This diversity of Questions and Images will help the model to be more generalised and won’t overfit the model.
While modeling we will have the create vectors of these questions and then pass it to a LSTM based Neural network. The embeddings of the questions will be created by using GloVe.

Univariate Analysis for Answers

Here, we will analyse the answers for each question and images. We will try to understand how the answers are provided in the dataset.

1. Distribution of the answer types

Inference:

There are 65 questions categories and 3 answer categories.
Out of the 3 categories in the answers “OTHERS” have the highest number of answers. This means most of the questions have a unique answer to the question asked with respect to the image.
If we make this a classification task there will be so many classes to classify from and the image might not always be the same. Same goes for the question. So making it a multiclass classification task will be pretty difficult.

2. Distribution of the answer length

Inference:

Maximum length of answers : 18
Minimum length of answers : 1
Mean length of answers : 1.1
As we can see most of the answers are of length 1 or 2 words

There are very few questions with long answers.
Most answers range from 1 to 3 words.

3. What kind of questions are generally answered by these 3 categories??

Inference:

We can clearly see that more than 40K questions start with “How many..”. That means most answers for these questions will be a numerical value.
For “What..” kind of questions the answers will be more diverse and will increase the diversity of the answers. As we can see from the plot that most of the questions starting with “What” have “OTHER” answer categories.
Questions starting with “Is the..”, “Are..”, “Does..” etc are typically answered using Yes/No.
Other questions types like “What color” or “Which” have more specialized responses, such as colors, or “left” and “right”, so they have a more diverse answer.

4. Answer distribution with question and images

Inference:

There are currently 23,234 unique one-word answers in our dataset for real images and 3,770 for abstract scenes.
As we can see for most questions starting with “IS”, “ARE”, “DOES”, “DO” etc, the answers are typically Yes or No.
For questions with “HOW”, the answers are numbers.
For questions with “WHAT”, “WHERE”, “WHY”, “WHICH” “WHO”, there are a wide range of answers as these questions can have a variety of answers as per the given question and the image.

5. Top 50 answers:

CONCLUSION FOR ANSWER ANALYSIS:

From the above experiment we saw that there are basically three types of answers available in our dataset:

Yes/No
Numbers
Others

Most of the answers are Yes/No type.
There is a wide range of answers available for different types of questions but that will lead to Bias Variance of trade off as the number of training examples available for Yes/No and Others vary with a large margin.

We will try to reduce the answer set for our convenience and will take only the top 1000 answers making this problem a 1000 Class Classification Problem.

So the question arises here is, What kind of Classification problem is it?

To answer that we will just have to look into the distribution plot of the answers because that’s the predicting label that the model is going to predict using the question and images as inputs.

1. What If we want to make this a multiclass classification task then how many classes do we need?

Inference:

Percentage of questions covered by top 1000 answers : 87.47084553032403 %
As we can see we can do a 1000 class classification task as many of the questions (around 388158 out of 443757) will be covered by this classification task.
Most of the answers in the dataset are YES/NO answers and some numerical values. Others are just one word answers and their count is very few.

Finally, we will pose this problem as a multi-class Classification problem with “QUESTIONS” and “IMAGES” as inputs to the model and “ANSWERS” as the Labels that the model will be predicting for.

FINAL THOUGHTS:

Now the question arises:

Q: What kind of model are we gonna make??

Ans: We will develop a 2-channel Deep learning model with 2 input pipelines. One for vision i.e. for the VQA Image and another Natural language question. The output of these two pipelines culminated with a softmax over K possible outputs. We choose the top K = 1000 most frequent answers as possible outputs. This set of answers covers 87.47% of the train set answers.

We describe the different components of our model below:

Image Channel: This channel provides an embedding for the image. We took the L2 normalized activations from the last hidden layer of the VGG-16 model (with pretrained weights) and encoded the image.
Question Channel: We made a custom LSTM model to encode the Questions using pre-trained word embedding of GloVe.

An overview of the model:

Reference

There are many potential applications for VQA. Probably the most direct application is to help blind and visually-impaired users. A VQA system could provide information about an image on the Web or any social media. Another obvious application is to integrate VQA into image retrieval systems. This could have a huge impact on social media or e-commerce. VQA can also be used for educational or recreational purposes.

Is image necessary?

In case you’re looking to do something similar for your business, you can contact Rubikon Labs.

We will discuss further about the model and how and why we came up with this approach along with the metrics and the accuracy of the model in the upcoming blog.

Data Analytics Data Engineering Data Science Rubikon Labs

Can AI help computers to see?