artificial intelligence, related to the interaction of computers and human (natural) languages.
In NLP techniques, we map words and phrases (from a dictionary or corpus) to vectors of numbers to facilitate processing. These types of language modeling techniques are called word embedding .
In 2013, Google announced word2vec , a group of related models that use for word embedding.
Let’s implement our own skip gram model (in Python) by deriving the backpropagation equations of our neural network.
In the skip gram architecture word2vec input — it is the central word, and predictions — context words. Consider an array of words W, if W (i) is the input (central word), then W (i-2), W (i-1), W (i + 1) and W (i + 2) are context words, if it is 2.
Let’s define some variables: V Number of unique words in our corpus of text ( V ocabulary) x Input layer ( One hot encoding of our input word). N Number of neurons in the hidden layer of neural network W Weights between input layer and hidden layer W’ Weights between hidden layer and output layer y A softmax output layer having probabilities of every word in our vocabulary
Our neural network architecture has been defined, now let’s do some maths to derive the equations needed for gradient descent.
Forward propagation :
Multiply one hot coding of the central word (denoted by x ) by the first weight matrix W, to get the hidden matrix of the layer h (size N x 1).
(Vx1) (NxV) (Vx1)
Now we multiply the vector h of the hidden layer by the second weight matrix W & # 39 ;, to get a new matrix u
(Vx1) (VxN) (Nx1)
Please note that we must use softmax" to get ours.
Let J be at the th neuron of the layer at
Let w j be the j- m word in our dictionary, where j — any index
Let V w j be j- m column of matrix W & # 39; (by the column corresponding to the word w j )
(1 × 1) (1 × N) (N × 1)
y = softmax ( i)
y j = softmax (u j )
y j denotes the probability that w j is the context word
P (w j | w i ) — this is the probability that w j is a context word, given that w i is an input word.
So our goal is to maximize P (w j * | w i ) , where j * represents the indices of the context words
It is clear that we want as much as possible
where j * c sub> — dictionary indexes of context words. Context words range from c = 1, 2, 3..C
Let’s take the negative log probability of this function to get our loss function we want to minify
Let’s be the actual day off vector from our training data, for a particular center word. It will have 1 in context word positions and 0 in all other places. t j * c — 1st words of context.
We can multiply with
Solving this equation gives our loss function as —
The configurable parameters are in the W matrices and W & # 39 ;, so we have to find the partial derivatives of our loss function in W and W & # 39; in order to apply the gradient descent algorithm.
We should find
Now, looking for
Bottom The implementation is not given: