Word2Vec consists of models for generating word embeddings. These models are shallow two-layer neural networks with one input layer, one hidden layer, and one output layer. Word2Vec uses two architectures:
- CBOW (Continuous Bag of Words): the CBOW model predicts the current word given by context words in a specific window. The input layer contains the words of context, and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent the current word present in the output layer.
- Skip Gram: Skip Gram predicts the surrounding words of the context in the specific window of the given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent the current word present on the input layer.
The main idea behind word embedding is that words that appear in a similar context tend to be closer together in vector space. Generating word vectors in Python requires the nltk
and gensim
modules.
Run these commands in the terminal to install nltk
and gensim
:
pip install nltk pip install gensim
Download the text file used to generate word vectors from here .
Below is the implementation:
|
Output:
Cosine similarity between ’alice’ and’ wonderland’ - CBOW: 0.999249298413 Cosine similarity between ’alice’ and’ machines’ - CBOW: 0.974911910445 Cosine similarity between ’alice’ and’ wonderland’ - Skip Gram: 0.885471373104 Cosine similarity between ’alice’ and’ machines’ - Skip Gram: 0.856892599521
The output indicates the cosine of similarity between the vectors of the words "Alice", "Wonderland" and "Cars" for different models. One interesting challenge might be to change the size and window values to observe changes in the similarity cosine.
Applications of Word Embedding: & gt ;" Sentiment Analysis "" Speech Recognition "" Information Retrieval "" Question Answering
Links: