Change language

Python | Measure the similarity between two sentences using cosine similarity

| | |

This program uses cosine similarity and the nltk toolkit module. To run this program, nltk must be installed on your system. To install the nltk module follow these steps:

 1. Open terminal ( Linux ). 2.sudo pip3 install nltk 3.python3 4.import nltk 5.nltk.download (’all’) 

Functions used —

nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.  word_tokenize (X) split the given sentence X into words and return list.

nltk.corpus: In this program, it is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).

Below is the Python implementation —

# A program for measuring similarity between
# two sentences using cosine similarity.

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

  
# X = input (& quot; Enter the first line: & quot;) .lower ()
# Y = input (& quot; Enter the second line: & quot; ) .lower ()

X = " I love horror movies "

Y = "Lights out is a horror movie"

 
# tokenization

X_list = word_tokenize (X) 

Y_list = word_tokenize (Y)

 
# sw contains stopword list

sw = stopwords .words ( ’english’

l1 = []; l2 = []

  
# remove stop words from line

X_set = {w for w in X_list if not w in sw} 

Y_set = {w for w in Y_list if not w in sw}

 
# generate a set containing the keywords of both lines

 

rvector = X_set.union (Y_set) 

for w in rvector:

if w in X_set: l1.append ( 1 ) # create vector

else : l1.append ( 0 )

if w in Y_set: l2.append ( 1 )

< code class = "keyword"> else : l2.append ( 0 )

c = 0

 
# cosine formula

for i in range ( len (rvector)):

c + = l1 [i] * l2 [i]

cosine = c / float (( sum (l1) * sum (l2)) * * 0.5 )

print ( "similarity:" , cosine)

Exit:

 similarity: 0.6666666666666666