Change language

Implement your own word2vec (skip-gram) model in Python

| | |

artificial intelligence, related to the interaction of computers and human (natural) languages.
In NLP techniques, we map words and phrases (from a dictionary or corpus) to vectors of numbers to facilitate processing. These types of language modeling techniques are called word embedding .

In 2013, Google announced word2vec , a group of related models that use for word embedding.

Let’s implement our own skip gram model (in Python) by deriving the backpropagation equations of our neural network.

In the skip gram architecture word2vec input — it is the central word, and predictions — context words. Consider an array of words W, if W (i) is the input (central word), then W (i-2), W (i-1), W (i + 1) and W (i + 2) are context words, if it is 2.

 Let’s define some variables:  V  Number of unique words in our corpus of text ( V  ocabulary)  x  Input layer ( One hot encoding  of our input word).  N  Number of neurons in the hidden layer of neural network  W  Weights between input layer and hidden layer  W’  Weights between hidden layer and output layer  y  A softmax output layer having probabilities of every word in our vocabulary 

Skip gram of architecture

Our neural network architecture has been defined, now let’s do some maths to derive the equations needed for gradient descent.

Forward propagation :

Multiply one hot coding of the central word (denoted by x ) by the first weight matrix W, to get the hidden matrix of the layer h (size N x 1).

(Vx1) (NxV) (Vx1)

Now we multiply the vector h of the hidden layer by the second weight matrix W & # 39 ;, to get a new matrix u


(Vx1) (VxN) (Nx1)
Please note that we must use softmax" to get ours.

Let J be at the th neuron of the layer at
Let w j be the j- m word in our dictionary, where j — any index
Let V w j be j- m column of matrix W & # 39; (by the column corresponding to the word w j )


(1 × 1) (1 × N) (N × 1)

y = softmax ( i)
y j = softmax (u j )
y j denotes the probability that w j is the context word

P (w j | w i ) — this is the probability that w j is a context word, given that w i is an input word.

So our goal is to maximize P (w j * | w i ) , where j * represents the indices of the context words

It is clear that we want as much as possible

where j * c — dictionary indexes of context words. Context words range from c = 1, 2, 3..C
Let’s take the negative log probability of this function to get our loss function we want to minify

Let’s be the actual day off vector from our training data, for a particular center word. It will have 1 in context word positions and 0 in all other places. t j * c — 1st words of context.
We can multiply with

Solving this equation gives our loss function as —

Backpropagation:

The configurable parameters are in the W matrices and W & # 39 ;, so we have to find the partial derivatives of our loss function in W and W & # 39; in order to apply the gradient descent algorithm.
We should find

Now, looking for

Bottom The implementation is not given:

import numpy as np

import string

from nltk.corpus import stopwords 

  

def softmax (x):

& quot; & quot; "Calculate the softmax values ​​for each set of points in x." & quot; & quot;

e_x = np.exp (x - np. max (x))

return e_x / e_x. sum ()

 

class word2vec ( object ):

  def __ init__ ( self ):

self . N = 10

self . X_train = []

self . y_train = []

self . window_size = 2

self . alpha = 0.001

self . words = []

self . word_index = {}

 

def initialize ( self , V, data):

self . V = V

  self . W = np.random.uniform ( - 0.8 , 0.8 , ( self . V, self . N))

  self . W1 = np.random.uniform ( - 0.8 , 0.8 , ( self . N, self . V))

 

self .words = data

for i in range ( len (data)):

self . word_index [data [i]] = i

 

 

def feed_forward ( self , X):

self . h = np.dot ( self . WT, X) .reshape ( self .N, 1 )

self . u = np.dot ( self . W1.T, self . h)

# print (self.u)

self . y = softmax ( self . u) 

return self . y

  

def backpropagate ( self , x, t):

e = self . y - np.asarra y (t) .reshape ( self . V, 1 )

# e.shape is V x 1

dLdW1 = np.dot ( self . h, eT)

X = np.array (x) .reshape ( self . V, 1 )

dLdW = np.dot (X, np.dot ( self . W1, e) .T)

self . W1 = self . W1 - self . alpha * dLdW1

self . W = self . W - self . alpha * dLdW

  

def train ( self , epochs):

for x in range ( 1 , epochs): 

self . loss = 0

for j in range ( len ( self . X_train)):

self . feed_forward ( self . X_train [j])

self .backpropagate ( self . X_train [j], self . y_train [j])

C = 0

for m in range ( self . V):

  if ( self . y_train [j] [m]) :

self . loss + = - 1 * self . u [m] [ 0 ]

C + = 1

self . loss + = C * np.log (np. sum (np.exp ( self . u)))

print ( "epoch" , x, "loss =" , self . loss)

  self . alpha * = 1 / (( 1 + self . Alpha * x))

 

def predict ( self , word, number_of_predictions):

if word in self . words:

  index = self . word_index [word]

X = [ 0 for i in range ( self .V)]

X [index] = 1

  prediction = self . feed_forward (X)

output = {}

for i in range ( self . V):

output [prediction [i] [ 0 ]] = i

 

top_context_words = []

for k in sorted (output, reverse = True ):

top_context_words.append ( self . words [out put [k]])

if ( len (top_context_words)" = number_of_predictions):

break

 

return top_context_words

else :

print ( "Word not found in dicitonary" )

def preprocessing (corpus):

stop_words = set (stopwords.words ( ’english’ )) 

  training_data = []

sentences = corpus.split ( "." )

for i in range ( len (sentences)):

  sentences [i] = sentences [i]. strip ()

sentence = sentences [i] .split ()

x = [word.strip (string.punctuation) for word in sentence

if word not in stop_words]

x = [word.lower () for word in x]

  training_data.append (x)

  return training_data

 

 

def prepare_data_for_training (sentences, w2v):

data = {}

for sentence in sentences:

  for word in se ntence:

if word not in data:

data [word] = 1

else :

data [word] + = 1

V = len (data)

data = sorted ( list (data.keys ()))

  vocab = {}

for i in range ( len (data) ):

vocab [data [i]] = i

  

# for me in the range (len (words)):

  for sentence in sentences:

  for i in range ( len (sentence)):

center_word = [ 0 for x in range (V)]

center_word [vocab [sentence [i]]] = 1

context = [ 0 for x in range (V)]

  

  for j in range (i - w2v.window_size, i + w2v.window_size):

code class = "keyword"> in range ( len (sentence)):

center_word = [ 0 for x in range (V)]

center_word [vocab [sentence [i]]] =