Image caption generation is a complex problem in AI that connects computer vision and NLP, where a text description should be generated for a given photo. In a general sense, for a given image as input, our model describes an accurate description of the image. This requires both an understanding of images from the computer vision of the Convolution Neural Network and a language model from the field of natural language processing.
It is important to guess and test several ways to formulate a given predictive modeling problem, and there are indeed many ways to formulate a caption problem for photographs. we’re sticking with what we’ll explain at the end of this article, so wait a while. you can hold Thor Hammer !!! No !! but you could keep a joke here.
So basically our model does, when we feed an image to our combined CNN and RNN architecture, then it generates a natural description of the image using NLP.
We show a generative model based on a deep Recurrent neural architecture that combines with machine translation and which can be used to generate natural sentences which describe an image. The model is trained to maximize the likelihood of the target descriptions sentence given the training images. Experiments on various datasets show the accuracy of the model and the fluency of the language which it learns solely from image descriptions.
Before diving in, let’s understand the basic terminology required to understand this algorithm.
They are basically of two types:
Image-based model: which extracts functions from an image.
Language-based model: which translates the functions and objects defined by our image-based model into a natural sentence.
The compressed feature vector is formed from convolutional neural networks . In general terms, this feature vector is called, and the CNN model is referred to as the model that encodes a given set of words and generates a sequence that is sent to the decoder network. In the next step, we will use these attachments from the CNN layer as input to the decoder, which decodes the input sequence and generates the output.
For example: translation of the language from French to English
In the Sentence Language Model LSTM predicts the next word in a sentence. Given the initial embedding of the image, the LSTM is trained to predict the most likely next sequence value. It’s like showing a person to a group of images and asking them to remember the details of the images, then showing them a new image that has similar content to the previous images and asking them to recall the content. This “recall” and “remember” work is done by our LSTM network, which is much more useful here. later, when I come to the implementation, a part will show you how it actually works.
In this article, we will use a pretrained convolution neural network that trains on the ImageNet dataset. Images are converted to standard definition 224 X 224 X 3 (nh x hw x nc), which makes the input constant for the model for any given image.
Technically, we also insert Start and Stop, to signal the end of the heading.
If the image description is “Tony Stark is standing with Doctor Strange“ , the source sequence is a list containing [’,’ Tony ’,’ Stark ’,’ is’, ’standing’, ’with’, ’Doctor’, ’Strange’] and the target sequence is a list containing [’Tony’, ’Stark’, ’is’, ’standing’, ’with’, ’Doctor’, ’Strange’, ’’] . Using these Source and Target sequences and the feature vector, the LSTM decoder is trained as a language model conditioned on the feature vector.
Below image explains better —
During the testing phase, the part of the encoder is almost the same as during the training phase. The only difference is that batchnorm layer uses variance and mean rather than mini-batch statistics. This can be easily accomplished with the encoder.eval () function. For the decoder part, there is a vital difference between the learning phase and the testing phase. During the testing phase, the LSTM decoder cannot observe the picture description. To deal with this situation, the LSTM decoder returns the previously generated word to the next input. This can be done with a for loop.
There are two models for generating titles:
Generating the whole sequence . The first approach is to create a full text description for the image object.
Input: Photograph Output: Complete textual description.
This is a one-to-many sequence prediction model that generates all the output in one go.
- This model puts a lot of stress on the language model to generate the correct words in the correct order .
- Images pass through an object extraction model, such as a model pretrained on the ImageNet dataset.
- One hot coding is used for the output sequence, allowing the model to predict the probability distribution of each word in sequences throughout the vocabulary.
- All sequences are padded to the same length, which mean s the model is forced to generate multiple “wordless” time steps in the output sequence.
- While testing this method, we found that a very large language model is required, and even then it is difficult to get around the model generating the NLP equivalent of persistence, eg: generating the same word repeated for all th length of the sequence as output.
Generate word from word: is another type of approach in which LSTM generates one word prediction given an image and one word as input.
Input 1: Image Input 2: Previously generated word or start of sequence token. Output: Next word in sequence.
This is a one-to-one sequence prediction model that generates a textual description by recursive calls to the model.
- Input of one word is either a token that indicates the beginning of the sequence in the case of the first call model, or a word generated from a previous call to the model.
- The image goes through an object extraction model, for example, a model pretrained on the ImageNet dataset, and the input word is encoded with an integer that goes through word embedding.
- The output word is one hot coding that allows the model to predict the probabilities of words across the entire vocabulary.
- The process of generating recursive words is repeated until an end-of-sequence marker is generated.
- After testing this method, we found that the model generates several good n-gram sequences, but ends up in a loop that repeats the same sequences These words are for long descriptions, which is redundant due to the fact that the model has a problem of insufficient memory to remember what was generated earlier.
Let’s get a deeper intuition with the example of a signature to the image.
In order to develop an image captioning model which we break down into three parts:
1) Extracting image features to use in the model.
2) Training the model on those features what we extracted from the Image.
3) Using the trained model to generate caption text when we pass the input image’s features to the network.
We have two different methods for this —
1 . Visual Geometry Group (VGG) neural network for extracting objects from an image.
2. Recurrent Neural Network (RNN) to train and generate heading text using the model.
Using a pre-trained VGG model, the image is read and resized to 224 * 224 * 3, which has three color channels, and then fed into the VGG neural network, where the elements are extracted as a Numpy array. Since it uses the VGG network for image classification, instead of getting the output from the last layer, we get the output from the fully connected (FC-2) layer, which contains data about the image features.
For an Image signature with Keras, create one LSTM (Long Term Short Term Memory) cell with 256 neurons. For this cell, we have four inputs: image objects, captions, mask, and current position. First, the title input and position input are merged (merged), and then it goes through the embedding layer words, then image objects and embedded words. also combined (using concatenation) with mask input. Together they go through the LSTM cell and then the LSTM cell’s output goes through the Dropout and Batch Normalization layer to prevent overfitting the model. Finally, Softmax nonlinearity is applied and we get the expected result.
As a result, we we get a vector in which each entry represents the possibility of each word in the dictionary. The word is most likely to be our current "best word." Along with the pre-built vocabulary, this vector is used to "interpret" the next generated word, which can be seen as a type of basic truth to teach the true heading. The mask plays an important role in all of this, "writing down" the previous words used in signatures so that the model knows the words before the current word, and inject the model at the current position of the sentence so it doesn’t get caught in the loop.
Similar to learning, we also need to get functions for each image that will be predicted. So, the images first go through the VGG-16 network architecture to generate the functions. We used the same LSTM model for the signature. The first word input for this model is the "# start #" tag, and the next input is the forecast result from the previous iteration.
We encourage you to take a look at this research work, to understand what exactly is going on.
The memory block contains the "C" location, which is controlled by three elements. Recurrent connections are shown in blue: output "m" at time t-1 is fed back into memory at time t through three elements, the cell value is returned through the forgetting gate and the predicted word at time & # 39; t-1 & # 39; returns in addition to the memory output & # 39; m & # 39; at time & # 39; t & # 39; into the Softmax function for word prediction. Read his inlet valve & # 39; i & # 39; and specify whether to output the new cell value (output gate o).
- Encoder-Decoder Architecture: Typically, a model that generates sequences uses an Encoder to encode the input into a fixed form and a Decoder to decode it word by word into a sequence.
- Caution: The use of attention grids is widespread in deep learning and for good reason. This is a way for the model to select only those parts of the encoding that, in its opinion, are relevant to the task at hand. The same mechanism you see here can be used in any model where the encoder output has multiple points in space or time. We believe that some pixels are more important than others in subtitles. In a sequence of tasks like machine translation, you think that some words are more important than others.
- Transfer Learning: when you borrow from an existing model, using parts of it in a new model, that is almost always better than training a new model from scratch (i.e. not knowing anything), as we will see, we can always fine-tune this second by manually knowing a specific problem and using pre-trained word embeddings — empty but valid example. We’ll use a pre-trained encoder and then tweak it as needed.
- Beam Finder: here we don’t let your decoder get lazy and just pick the best performing words at each decoding step and ray search is useful for any language modeling task as it finds the most optimal sequence.
Let’s get it straight through the code:
Anaconda Pytorch MSCOCO Dataset
To replicate the results of this article, please make sure you set the prerequisites. Now let’s train the model from scratch by following the instructions below.
git clone https://github.com/pdollar/coco.git cd coco / PythonAPI / make python setup.py build python setup.py install cd ../../ git clone https://github.com/yunjey/pytorch-tutorial.git cd pytorch-tutorial / tutorials / 03-advanced / image_captioning / pip install -r requirements.txt
Note: we suggest you google Colab
Pretrained model —
Let’s load the captured model and the dictionary file from here then we should extract pretrained_model.zip to ./models/ and vocab.pkl to ./data/ using the command unzip.
Now the go model A product that can predict headings using:
$ python sample.py --image = ’/ example.png’
Let’s start the show!
Import all libraries and make sure the notebook is in the root of the repository:
The hard-coded model cannot be changed:
To upload an image, add this config code and:
Now let’s Let’s code a PyTorch function that uses pref well-prepared data files for predicting output:
Let’s start by capturing some scenes from the EndGame Avenger and see how well it generalizes, remember to enjoy.
Use the following code to predict the tags: