Python | Word similarities using spaCy

The SpaCy model —
spaCy supports two methods for finding word similarity: using context sensitive tensors and using word vectors. Below is the code to download these models.

 # Downloading the small model containing tensors. python -m spacy download en_core_web_sm # Downloading over 1 million word vectors. python -m spacy download en_core_web_lg 

Below is the code to find word similarity that can be extended to sentences and documents.

import spacy

 

nlp = spacy.load ( `en_core_web_md` )

 

print ( "Enter two space-separated words" )

words = input ()

  

tokens = nlp (words)

  

for token in tokens:

# Prints the following attributes of each token.

# text: word string, has_vector: if it contains

# vector view in the model,

  # vector_norm: vector algebraic norm,

  # is_oov: if the word is out of the dictionary.

print (token.text, token.has_vector, token.vector_norm, token.is_oov) 

 

token1, token2 = tokens [ 0 ], tokens [ 1 ]

 

print ( "Similarity:" , token1.similarity (token2))

Exit:

 cat True 6.6808186 False dog True 7.0336733 False Similarity: 0.80168545 

Model “en_core_web_md” gives vectors of size 300 * 1 for “dog” and “cat”. You can also use the larger en_vectors_web_lg model, which gives vectors of higher dimension for the same two words.

Using custom language models —
By simply switching the language model, we can find similarities between Latin, French or German documents. SpaCy currently supports 49 languages. spaCy also allows you to capture word vectors for words according to user needs. Below is an example.

Exit:

 Before custom setting array ([0., 0., 0., 0., 0., 0., 0., 0., ---]) After custom setting array ([0.68106073, 0.6037007, 0.9526876, -0.25600302, -0.24049562, ---]) 

import spacy

import numpy as np

from spacy.vocab import Vocab

  

nlp = spacy.load ( `en_core_web_md` )

new_word = `bucrest`

  

print ( `Before custom setting` )

print  (vocab.get_vector ( `bucrest` ))

 

custom_vector = np.random.uniform ( - 1 , 1 , ( 300 ,))

 
vocab.set_vector (new_word, custom_vector)

 

print ( `After custom setting` )

print (vocab.get_vector ( `bucrest` ))