NumPy | Python Methods and Functions | String Variables

The tf-idf value increases in proportion to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to correct for the fact that some words appear the most often.

TF-IDF uses two statistical methods , the first of them — Term Frequency and the other — Inverse Document Frequency. Term frequency refers to the total number of times a given term t appears in a doc document, in relation to (per) the total number of all words in the document and the Inverse Document Frequency, which determines how much information a word provides. It measures the weight of a given word throughout the entire document. IDF shows how often or rarely a given word appears in all documents.

TF-IDF can be calculated as tf * idf

Tf * Idf does not directly convert raw data into useful functions. First, it converts raw strings or dataset to vectors, and each word has its own vector. Then we will use a specific method to extract a function like Cosine Similarity that works with vectors, etc. As we know, we cannot directly pass a string to our model. So tf * idf gives us the numeric values of the entire document.

To extract elements from the word document, we import —

from sklearn.feature_extraction.text import TfidfVectorizer

** Input: **

1st Sentence - "hello i am pulkit" 2nd Sentence - "your name is akshit"

** Code: Python code to find similarity measures **

` `

` ` ` # importing libraries `

` from `

` sklearn.feature_extraction.text `

` import `

` TfidfVectorizer `

` from `

` sklearn.metrics.pairwise `

` import `

` cosine_similarity `

` ` ` from `

` sklearn.metrics `

` import `

` pairwise_distances `

` ` ` from `

` sklearn.metrics.pairwise `

` import `

` euclidean_distances `

` from `

` scipy.spatial `

` import `

` distance `

` import `

` pandas as pd `

` import `

` numpy as np `

` ## Convert 3D array to 1D array `

` def `

` arr_convert_1d (arr): `

` arr `

` = `

` np.array (arr) `

` arr `

` = `

` np.concaten ate (arr, axis `

` = `

` 0 `

`) `

` arr `

` = `

` np.concatenate (arr, axis `

` = `

` 0 `

`) `

` return `

` arr `

` ## Cosine Similarity `

` cos `

` = `

` [] `

` def `

` cosine (trans): `

` cos.append (cosine_similarity (trans [`

` 0 `

`], trans [`

` 1 `

`])) `

` `

` ## Manhattan Distance `

` manhatten `` = `

` [] `

` ` ` def `

` manhatten_distance (trans): `

` manhatten.append (pairwise_distances (trans [`

` 0 `

`], trans [`

` 1 `

`], `

` metric `

` = `

`` manhattan` `

`)) `

` ## Euclidean distance `

` euclidean `

` = `

` [] `

` def `

` euclidean_function (v ectors): `

` euc `

` = `

` euclidean_distances (vectors [`

` 0 `

`], vectors [`

` 1 `

`]) `

` euclidean.append (euc) `

` `

` # This function finds similarities between the two `

` # suggestions using the above functions. `

` ## TF - IDF `

` def `

` tfidf ( str1, str2): `

` ques `

` = `

` [] `

` # You must provide a dataset. Dataset link `

` # is given at the end of this article. `

` # and if you are using a different dataset, adjust `

` `

` # according to columns and rows of your dataset `

` `

` dataset `

` = `

` pd.read_csv (`

` `C : Users dell Desktop quora_duplicate_questions.tsv` `

`, `

` delimiter `

` = `

` `` `

`, encoding `

` = `

` `utf-8` `

`) `

` `

` x `

` = `

` dataset.iloc [:, `` 1 `

`: `

` 5 `

`] `

` ` ` x `

` = `

` x.dropna (how `

` = `

` `any` `

`) `

` `

` `

` for `

` k `

` in `

` range `

` (`

` len `

` (x)): `

` for `

` j `

` in `` [`

` 2 `

`, `

` 3 `

`]: `

` ` ` `

` ques.append (x.iloc [k, j]) `

` vect `

` = `

` TfidfVectorizer () `

` # Your entire dataset will fit. Ultimately this will `

` # produce vectors based on words in the corpus / dataset `

` vect.fit (ques) `

` corpus `

` = `

` [str1, str2] `

` `

` trans `

` = `

` vect.transform (corpus) `

` euclidean_function (trans) `

` cosine (trans) `

` `` manhatten_distance (trans) `

` ` ` return `

` convert () `

` def `

` convert (): `

` dataf `

` = `

` pd.DataFrame () `

` lis2 `

` = `

` arr_convert_1d (manhatten) `

` dataf [`

` `manhatten` `

`] `

` = `

` lis2 `

` lis2 `

` = `

` arr_convert_1d (cos) `

` `` dataf [`

` `cos_sim` `

`] `

` = `

` lis2 `

` lis2 `

` = `

` arr_convert_1d (euclidean) `

` dataf [`

` `euclidean` `

`] `

` = `

` lis2 `

` return `

` dataf `

` `

` newData `

` = `

` pd.DataFrame (); `

` str1 `

` = `

` "hello i am pulkit "`

` str2 `

` = `

`" your name is akshit "`

` newData `

` = `

` tfidf (str1 , str2); `

` print `

` (newData); `

** Output: **

manhatten cos_sim euclidean 0 2.955813 0.0 1.414214

** Dataset: ** Google Drive link

** Note: ** The dataset is large, so it will take 30-40 seconds to display, and if you are going to work as it is, then it won`t work. This only works when you copy this code into your IDE and provide your dataset in the tfidf function.

Vincent Bumgarner has been designing software for nearly 20 years, working in many languages on nearly as many platforms. He started using Splunk in 2007 and has enjoyed watching the product evolve ov...

10/07/2020

While there is no arguing about the staying power of the cloud model and the benefits it can bring to any organization or government, mainstream adoption depends on several key variables falling into ...

10/07/2020

Mark Lutz is the global leader in Python training, author of the oldest and best-selling Python texts, and a pioneer in the Python community since 1992.

Mark Lutz is the author of the found...

11/08/2021

For many decades, some powerful trends have been in place. Computer hardware has rap- idly been getting faster, cheaper and smaller. Internet bandwidth (that is, its information carrying capacity) has...

23/09/2020

X
# Submit new EBook