artificial intelligence, related to the interaction of computers and human (natural) languages.
In NLP techniques, we map words and phrases (from a dictionary or corpus) to vectors of numbers to facilitate processing. These types of language modeling techniques are called word embedding .
In 2013, Google announced word2vec , a group of related models that use for word embedding.
Let`s implement our own skip gram model (in Python) by deriving the backpropagation equations of our neural network.
In the skip gram architecture word2vec input — it is the central word, and predictions — context words. Consider an array of words W, if W (i) is the input (central word), then W (i-2), W (i-1), W (i + 1) and W (i + 2) are context words, if it is 2.
Let`s define some variables: V Number of unique words in our corpus of text ( V ocabulary) x Input layer ( One hot encoding of our input word). N Number of neurons in the hidden layer of neural network W Weights between input layer and hidden layer W` Weights between hidden layer and output layer y A softmax output layer having probabilities of every word in our vocabulary
Our neural network architecture has been defined, now let`s do some maths to derive the equations needed for gradient descent.
Multiply one hot coding of the central word (denoted by x ) by the first weight matrix W, to get the hidden matrix of the layer h strong > (size N x 1).
(Vx1) (NxV) (Vx1) strong>
Now we multiply the vector h of the hidden layer by the second weight matrix W & # 39 ;, to get a new matrix u strong>
(Vx1) (VxN) (Nx1) em>
Please note that we must use softmax & gt; to get ours.
Let J be at the th neuron of the layer at
Let w j be the j- m word in our dictionary, where j — any index
Let V w j be j- m column of matrix W & # 39; (by the column corresponding to the word w j )
(1 × 1) (1 × N) (N × 1)
y = softmax ( i)
y j = softmax (u j )
y j denotes the probability that w j is the context word
P (w j | w i ) — this is the probability that w j is a context word, given that w i is an input word.
So our goal is to maximize P (w j * | w i ) , where j * represents the indices of the context words
It is clear that we want as much as possible
where j * c sub> — dictionary indexes of context words. Context words range from c = 1, 2, 3..C
Let`s take the negative log probability of this function to get our loss function strong> we want to minify
Let`s be the actual day off vector from our training data, for a particular center word. It will have 1 in context word positions and 0 in all other places. t j * c — 1st words of context.
We can multiply with
Solving this equation gives our loss function as — p>
The configurable parameters are in the W matrices and W & # 39 ;, so we have to find the partial derivatives of our loss function in W and W & # 39; in order to apply the gradient descent algorithm.
We should find
Now, looking for
Bottom The implementation is not given: