ML | T-distributed stochastic neighbor embedding (t-SNE) algorithm

NumPy | Python Methods and Functions

What is dimension reduction?
Dimension reduction — it is a method of representing n-dimensional data (multidimensional data with many elements) in 2 or 3 dimensions.

An example of dimensionality reduction can be discussed as a classification problem, i.e. the student will play football or not, which depends on both temperature and humidity, and can be summarized in a single basic characteristic, since both functions are highly correlated. Therefore, we can reduce the number of functions in such tasks. The problem of three-dimensional classification is difficult to imagine, and two-dimensional can be compared with a simple two-dimensional space, and the problem of one-dimensional — with a simple line.

How does t-SNE work?
The t-SNE nonlinear dimensionality reduction algorithm finds patterns in the data based on the similarity of data points to features, point similarity is calculated as the conditional probability that point A will choose point B as its neighbor. 
It then tries to minimize the difference between these conditional probabilities (or similarities) in high-dimensional and low-dimensional space to perfectly represent data points in low-dimensional space.

Space and time complexity

Applying t-SNE to the MNIST dataset

# Import required modules.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

from sklearn.preprocessing import StandardScaler

Code # 1: Reading data

< code>

# Reading data using pandas

df = pd.read_csv ( `mnist_train.csv` )

  
# print the first five lines df

print (df.head ( 4 )) 

 
# save tags to l variable.

l = df [ ` label` ]

 
# Remove the tag and save the data pixels per d.  

d = df.drop ( "label" , axis = 1 )

Output:

Code # 2: data preprocessing

# Data preprocessing: data standardization

from sklearn.preprocessing import StandardScaler

 

standardized_data = StandardScale r (). fit_transform (data)

 

print (standardized_data.shape)

Output:

Code # 3 :

# TSNE
# Choose the best 1000 points as TSNE
# takes a long time for 15K points

data_1000 = standardized_data [ 0 : 1000 ,:]

labels_1000 = labels [ 0 : 1000 ]

 

model = TSNE (n_components = 2 , random_state = 0 )

# setting parameters
# number of components = 2
# default bewilderment = 30
# default learning rate = 200
# default Maximum number of iterations
# for optimization = 1000

  

tsne_data = model.fit_transform (data_1000)

 

 
# create a new data frame that
# help us build the results data

tsne_data = np.vstack ((tsne_data .T, labels_1000)). T

tsne_df = pd.DataFrame (data = tsne_data,

columns = ( "Dim_1" , "Dim_2" , "label" ))

  
# Building the cne result

sn.FacetGrid (tsne_df, hue = "label" , size = 6 ). map (

plt.scatter, `Dim_1` , ` Dim_2` ). add_legend ()

 
plt.show ()

Output:





Get Solution for free from DataCamp guru