Change language

Python | Embedding words with Word2Vec

Word2Vec consists of models for generating word embeddings. These models are shallow two-layer neural networks with one input layer, one hidden layer, and one output layer. Word2Vec uses two architectures:

  1. CBOW (Continuous Bag of Words): the CBOW model predicts the current word given by context words in a specific window. The input layer contains the words of context, and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent the current word present in the output layer. 
  2. Skip Gram: Skip Gram predicts the surrounding words of the context in the specific window of the given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent the current word present on the input layer. 

The main idea behind word embedding is that words that appear in a similar context tend to be closer together in vector space. Generating word vectors in Python requires the nltk and gensim modules.

Run these commands in the terminal to install nltk and gensim :

 pip install nltk pip install gensim 

Download the text file used to generate word vectors from here .

Below is the implementation:

# Python program to generate word vectors using Word2Vec

 
# import all required modules

from nltk.tokenize import sent_tokenize, word_tok enize

import warnings

  

warnings.filterwarnings (action = ’ ignore’ )

 

import gensim

from gensim.models import Word2Vec

 
# Reads the alice.txt file

sample = open ( "C: Users Admin Desktop alice.txt " , " r " )

= sample.read ()

 
# Replaces the escape character with a space

f = s.replace ( "" , "" )

 

data = []

 
# iterate over each sentence in the file

for i in sent_tokenize (f):

temp = []

  

  # split sentence into words

for j in word_tokenize (i):

temp .append (j.lower ())

 

data.append (temp)

 
# Create CBOW model

model1 = gensim.models.Word2Vec (data, min_count = 1

size = 100 , window = 5 )

 
# Print results

print ( "Cosine similarity between’ alice’ " +  

"and’ wonderland’ - CBOW: " ,

  model1. similarity ( ’alice’ , ’ wonderland’ ))

 

print ( "Cosine similarity between’ alice’ " +

  " and ’machines’ - CBOW:" ,

model1.similarity ( ’alice’ , ’machines’ ))

  
# Create Skip Gram model

model2 = gensim.models.Word2Vec (data, min_count = 1 , size = 100 ,

window = 5 , sg = 1 )

 
# Print results

print ( "Cosine similarity between’ alice’ " +

"and’ wonderland’ - Skip Gram: " ,

  model2 .similarity ( ’alice’ , ’ wonderland’ ))

 

print ( "Cosine similarity between’ alice’ " +

"and’ machines’ - Skip Gram: " , < / p>

model2.similarity ( ’alice’ , ’machines’ ))

Output:

 Cosine similarity between ’alice’ and’ wonderland’ - CBOW: 0.999249298413 Cosine similarity between ’alice’ and’ machines’ - CBOW: 0.974911910445 Cosine similarity between ’alice’ and’ wonderland’ - Skip Gram: 0.885471373104 Cosine similarity between ’alice’ and’ machines’ - Skip Gram: 0.856892599521 

The output indicates the cosine of similarity between the vectors of the words "Alice", "Wonderland" and "Cars" for different models. One interesting challenge might be to change the size and window values ​​to observe changes in the similarity cosine.

  Applications of Word Embedding:   & gt ;"  Sentiment Analysis ""  Speech Recognition ""  Information Retrieval ""  Question Answering 

Links:

Shop

Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best laptop for development

$499+
Gifts for programmers

Best laptop for Cricut Maker

$299+
Gifts for programmers

Best laptop for hacking

$890
Gifts for programmers

Best laptop for Machine Learning

$699+
Gifts for programmers

Raspberry Pi robot kit

$150

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically