Image caption generation is a complex problem in AI that connects computer vision and NLP, where a text description should be generated for a given photo. In a general sense, for a given image as input, our model describes an accurate description of the image. This requires both an understanding of images from the computer vision of the Convolution Neural Network and a language model from the field of natural language processing.
It is important to guess and test several ways to formulate a given predictive modeling problem, and there are indeed many ways to formulate a caption problem for photographs. we`re sticking with what we`ll explain at the end of this article, so wait a while. you can hold Thor Hammer !!! No !! but you could keep a joke here.
So basically our model does, when we feed an image to our combined CNN and RNN architecture, then it generates a natural description of the image using NLP. p >
We show a generative model based on a deep Recurrent neural architecture that combines with machine translation and which can be used to generate natural sentences which describe an image. The model is trained to maximize the likelihood of the target descriptions sentence given the training images. Experiments on various datasets show the accuracy of the model and the fluency of the language which it learns solely from image descriptions.
Before diving in, let`s understand the basic terminology required to understand this algorithm.
They are basically of two types:
Image-based model: which extracts functions from an image.
Language-based model: which translates the functions and objects defined by our image-based model into a natural sentence.
The compressed feature vector is formed from convolutional neural networks . In general terms, this feature vector is called, and the CNN model is referred to as the model that encodes a given set of words and generates a sequence that is sent to the decoder network. In the next step, we will use these attachments from the CNN layer as input to the decoder, which decodes the input sequence and generates the output.
For example: translation of the language from French to English
In the Sentence Language Model LSTM predicts the next word in a sentence. Given the initial embedding of the image, the LSTM is trained to predict the most likely next sequence value. It`s like showing a person to a group of images and asking them to remember the details of the images, then showing them a new image that has similar content to the previous images and asking them to recall the content. This “recall” and “remember” work is done by our LSTM network, which is much more useful here. later, when I come to the implementation, a part will show you how it actually works.
In this article, we will use a pretrained convolution neural network that trains on the ImageNet dataset. Images are converted to standard definition 224 X 224 X 3 (nh x hw x nc), which makes the input constant for the model for any given image.
Technically, we also insert Start and Stop, to signal the end of the heading.
If the image description is “Tony Stark is standing with Doctor Strange“ , the source sequence is a list containing [`,` Tony `,` Stark `,` is`, `standing`, `with`, `Doctor`, `Strange`] and the target sequence is a list containing [`Tony`, `Stark`, `is`, `standing`, `with`, `Doctor`, `Strange`, “] . Using these Source and Target sequences and the feature vector, the LSTM decoder is trained as a language model conditioned on the feature vector.
Below image explains better —
During the testing phase, the part of the encoder is almost the same as during the training phase. The only difference is that batchnorm layer uses variance and mean rather than mini-batch statistics. This can be easily accomplished with the encoder.eval () function. For the decoder part, there is a vital difference between the learning phase and the testing phase. During the testing phase, the LSTM decoder cannot observe the picture description. To deal with this situation, the LSTM decoder returns the previously generated word to the next input. This can be done with a for loop.
There are two models for generating titles:
Generating the whole sequence . The first approach is to create a full text description for the image object.
Input: Photograph Output: Complete textual description.
This is a one-to-many sequence prediction model that generates all the output in one go.
Generate word from word: is another type of approach in which LSTM generates one word prediction given an image and one word as input.
Input 1: Image Input 2: Previously generated word or start of sequence token. Output: Next word in sequence.
This is a one-to-one sequence prediction model that generates a textual description by recursive calls to the model.
Let`s get a deeper intuition with the example of a signature to the image.
In order to develop an image captioning model which we break down into three parts: p >
1) Extracting image features to use in the model.
2) Training the model on those features what we extracted from the Image.
3) Using the trained model to generate caption text when we pass the input image`s features to the network.
We have two different methods for this —
1 . Visual Geometry Group (VGG) neural network for extracting objects from an image.
2. Recurrent Neural Network (RNN) to train and generate heading text using the model.
Using a pre-trained VGG model, the image is read and resized to 224 * 224 * 3, which has three color channels, and then fed into the VGG neural network, where the elements are extracted as a Numpy array. Since it uses the VGG network for image classification, instead of getting the output from the last layer, we get the output from the fully connected (FC-2) layer, which contains data about the image features.
For an Image signature with Keras, create one LSTM (Long Term Short Term Memory) cell with 256 neurons. For this cell, we have four inputs: image objects, captions, mask, and current position. First, the title input and position input are merged (merged), and then it goes through the embedding layer words, then image objects and embedded words. also combined (using concatenation) with mask input. Together they go through the LSTM cell and then the LSTM cell`s output goes through the Dropout and Batch Normalization layer to prevent overfitting the model. Finally, Softmax nonlinearity is applied and we get the expected result.
As a result, we we get a vector in which each entry represents the possibility of each word in the dictionary. The word is most likely to be our current “best word.” Along with the pre-built vocabulary, this vector is used to “interpret” the next generated word, which can be seen as a type of basic truth to teach the true heading. The mask plays an important role in all of this, “writing down” the previous words used in signatures so that the model knows the words before the current word, and inject the model at the current position of the sentence so it doesn`t get caught in the loop.
Similar to learning, we also need to get functions for each image that will be predicted. So, the images first go through the VGG-16 network architecture to generate the functions. We used the same LSTM model for the signature. The first word input for this model is the “# start #” tag, and the next input is the forecast result from the previous iteration.
We encourage you to take a look at this research work, to understand what exactly is going on.
The memory block contains the “C” location, which is controlled by three elements. Recurrent connections are shown in blue: output “m” at time t-1 is fed back into memory at time t through three elements, the cell value is returned through the forgetting gate and the predicted word at time & # 39; t-1 & # 39; returns in addition to the memory output & # 39; m & # 39; at time & # 39; t & # 39; into the Softmax function for word prediction. Read his inlet valve & # 39; i & # 39; and specify whether to output the new cell value (output gate o).
Anaconda Pytorch MSCOCO Dataset
To replicate the results of this article, please make sure you set the prerequisites. Now let`s train the model from scratch by following the instructions below.
git clone https://github.com/pdollar/coco.git cd coco / PythonAPI / make python setup.py build python setup.py install cd ../../ git clone https://github.com/yunjey/pytorch-tutorial.git cd pytorch-tutorial / tutorials / 03-advanced / image_captioning / pip install -r requirements.txt
Note: we suggest you google Colab
Pretrained model —
Let`s load the captured model and the dictionary file from here then we should extract pretrained_model.zip to ./models/ and vocab.pkl to ./data/ using the command unzip.
Now the go model A product that can predict headings using:
$ python sample.py --image = `/ example.png`
Import all libraries and make sure the notebook is in the root of the repository:
The hard-coded model cannot be changed:
To upload an image, add this config code and:
Now let`s Let`s code a PyTorch function that uses pref well-prepared data files for predicting output:
Let`s start by capturing some scenes from the EndGame Avenger and see how well it generalizes, remember to enjoy.
Use the following code to predict the tags: b>