artificial intelligence, related to the interaction of computers and human (natural) languages.

In NLP techniques, we map words and phrases (from a dictionary or corpus) to vectors of numbers to facilitate processing. These types of ** language modeling ** techniques are called ** word embedding **.

In 2013, Google announced ** word2vec **, a group of related models that use for word embedding.

Let`s implement our own skip gram model (in Python) by deriving the backpropagation equations of our neural network.

In the ** skip gram architecture ** word2vec input — it is ** the central word, ** and predictions — context words. Consider an array of words W, if W (i) is the input (central word), then W (i-2), W (i-1), W (i + 1) and W (i + 2) are context words, if it is 2.

Let`s define some variables:VNumber of unique words in our corpus of text (Vocabulary)xInput layer ( One hot encoding of our input word).NNumber of neurons in the hidden layer of neural networkWWeights between input layer and hidden layerW`Weights between hidden layer and output layeryA softmax output layer having probabilities of every word in our vocabulary

Our neural network architecture has been defined, now let`s do some maths to derive the equations needed for gradient descent.

Multiply one hot coding of the central word (denoted by ** x **) by the first weight matrix ** W, ** to get the hidden matrix of the layer ** h (size N x 1). **

Now we multiply the vector ** h ** of the hidden layer by the second weight matrix ** W & # 39 ;, ** to get a new matrix ** u **

** (Vx1) (VxN) (Nx1) **

Please note that we must use softmax & gt; to get ours.

Let ** _{J} ** be at the

Let

Let

** (1 × 1) (1 × N) (N × 1) **

y = softmax ( i)

y _{ j } = softmax (u _{ j })

y _{ j } denotes the probability that w _{ j } is the context word

** P (w _{ j } | w _{ i }) ** — this is the probability that w

So our goal is to maximize ** P (w _{ j * } | w _{ i }) **, where j * represents the indices of the context words

It is clear that we want as much as possible

where ** j * _{ c }** — dictionary indexes of context words. Context words range from

Let`s take the

Let`s be the actual day off vector from our training data, for a particular center word. It will have 1 in context word positions and 0 in all other places. t _{ j * c } — 1st words of context.

We can multiply with

Solving this equation gives our loss function as —

The configurable parameters are in the W matrices and W & # 39 ;, so we have to find the partial derivatives of our loss function in W and W & # 39; in order to apply the gradient descent algorithm.

We should find

Now, looking for

Bottom The implementation is not given:

```
``` |

```
``` |