Implement your own word2vec (skip-gram) model in Python

artificial intelligence, related to the interaction of computers and human (natural) languages.
In NLP techniques, we map words and phrases (from a dictionary or corpus) to vectors of numbers to facilitate processing. These types of language modeling techniques are called word embedding .

In 2013, Google announced word2vec , a group of related models that use for word embedding.

Let`s implement our own skip gram model (in Python) by deriving the backpropagation equations of our neural network.

In the skip gram architecture word2vec input — it is the central word, and predictions — context words. Consider an array of words W, if W (i) is the input (central word), then W (i-2), W (i-1), W (i + 1) and W (i + 2) are context words, if it is 2.

 Let`s define some variables:  V  Number of unique words in our corpus of text ( V  ocabulary)  x  Input layer ( One hot encoding  of our input word).  N  Number of neurons in the hidden layer of neural network  W  Weights between input layer and hidden layer  W`  Weights between hidden layer and output layer  y  A softmax output layer having probabilities of every word in our vocabulary 

Skip gram of architecture

Our neural network architecture has been defined, now let`s do some maths to derive the equations needed for gradient descent.

Forward propagation :

Multiply one hot coding of the central word (denoted by x ) by the first weight matrix W, to get the hidden matrix of the layer h (size N x 1).

(Vx1) (NxV) (Vx1)

Now we multiply the vector h of the hidden layer by the second weight matrix W & # 39 ;, to get a new matrix u


(Vx1) (VxN) (Nx1)
Please note that we must use softmax & gt; to get ours.

Let J be at the th neuron of the layer at
Let w j be the j- m word in our dictionary, where j — any index
Let V w j be j- m column of matrix W & # 39; (by the column corresponding to the word w j )


(1 × 1) (1 × N) (N × 1)

y = softmax ( i)
y j = softmax (u j )
y j denotes the probability that w j is the context word

P (w j | w i ) — this is the probability that w j is a context word, given that w i is an input word.

So our goal is to maximize P (w j * | w i ) , where j * represents the indices of the context words

It is clear that we want as much as possible

where j * c — dictionary indexes of context words. Context words range from c = 1, 2, 3..C
Let`s take the negative log probability of this function to get our loss function we want to minify

Let`s be the actual day off vector from our training data, for a particular center word. It will have 1 in context word positions and 0 in all other places. t j * c — 1st words of context.
We can multiply with

Solving this equation gives our loss function as —

Backpropagation:

The configurable parameters are in the W matrices and W & # 39 ;, so we have to find the partial derivatives of our loss function in W and W & # 39; in order to apply the gradient descent algorithm.
We should find

Now, looking for

Bottom The implementation is not given:

import numpy as np

import string

from nltk.corpus import stopwords 

  

def softmax (x):

& quot; & quot; "Calculate the softmax values ​​for each set of points in x." & quot; & quot;

e_x = np.exp (x - np. max (x))

return e_x / e_x. sum ()

 

class word2vec ( object ):

  def __ init__ ( self ):

self . N = 10

self . X_train = []

self . y_train = []

self . window_size = 2

self . alpha = 0.001

self . words = []

self . word_index = {}

 

def initialize ( self , V, data):

self . V = V

  self . W = np.random.uniform ( - 0.8 , 0.8 , ( self . V, self . N))

  self . W1 = np.random.uniform ( - 0.8 , 0.8 , ( self . N, self . V))

 

self .words = data

for i in range ( len (data)):

self . word_index [data [i]] = i

 

 

def feed_forward ( self , X):

self . h = np.dot ( self . WT, X) .reshape ( self .N, 1 )

self . u = np.dot ( self . W1.T, self . h)

# print (self.u)

self . y = softmax ( self . u) 

return self . y

  

def backpropagate ( self , x, t):

e = self . y - np.asarra y (t) .reshape ( self . V, 1 )

# e.shape is V x 1

dLdW1 = np.dot ( self . h, eT)

X = np.array (x) .reshape ( self . V, 1 )

dLdW = np.dot (X, np.dot ( self . W1, e) .T)

self . W1 = self . W1 - self . alpha * dLdW1

self . W = self . W - self . alpha * dLdW

  

def train ( self , epochs):

for x in range ( 1 , epochs): 

self . loss = 0

for j in range ( len ( self . X_train)):

self . feed_forward ( self . X_train [j])

self .backpropagate ( self . X_train [j], self . y_train [j])

C = 0

for m in range ( self . V):

  if ( self . y_train [j] [m]) :

self . loss + = - 1 * self . u [m] [ 0 ]

C + = 1

self . loss + = C * np.log (np. sum (np.exp ( self . u)))

print ( "epoch" , x, "loss =" , self . loss)

  self . alpha * = 1 / (( 1 + self . Alpha * x))

 

def predict ( self , word, number_of_predictions):

if word in self . words:

  index = self . word_index [word]

X = [ 0 for i in range ( self .V)]

X [index] = 1

  prediction = self . feed_forward (X)

output = {}

for i in range ( self . V):

output [prediction [i] [ 0 ]] = i

 

top_context_words = []

for k in sorted (output, reverse = True ):

top_context_words.append ( self . words [out put [k]])

if ( len (top_context_words) & gt; = number_of_predictions):

break

 

return top_context_words

else :

print ( "Word not found in dicitonary" )

def preprocessing (corpus):

stop_words = set (stopwords.words ( `english` )) 

  training_data = []

sentences = corpus.split ( "." )

for i in range ( len (sentences)):

  sentences [i] = sentences [i]. strip ()

sentence = sentences [i] .split ()

x = [word.strip (string.punctuation) for word in sentence

if word not in stop_words]

x = [word.lower () for word in x]

  training_data.append (x)

  return training_data

 

 

def prepare_data_for_training (sentences, w2v):

data = {}

for sentence in sentences:

  for word in se ntence:

if word not in data:

data [word] = 1

else :

data [word] + = 1

V = len (data)

data = sorted ( list (data.keys ()))

  vocab = {}

for i in range ( len (data) ):

vocab [data [i]] = i

  

# for me in the range (len (words)):

  for sentence in sentences:

  for i in range ( len (sentence)):

center_word = [ 0 for x in range (V)]

center_word [vocab [sentence [i]]] = 1

context = [ 0 for x in range (V)]

  

  for j in range (i - w2v.window_size, i + w2v.window_size):

code class = "keyword"> in range ( len (sentence)):

center_word = [ 0 for x in range (V)]

center_word [vocab [sentence [i]]] =