ML | Data classification using auto-encoder

This article will show you how to use an auto-encoder to classify data. The data used below represents credit card transactions to predict whether a given transaction is fraudulent or not. The data can be downloaded here .

Step 1: Download the required libraries

import pandas as pd 

import numpy as np

from sklearn.model_selection import train_test_split 

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import MinMaxScaler 

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

import seaborn as sns

from keras.layers import Input , Dense

from keras.models import Model , Sequential

from keras import regularizers

Step 2: Load data

# Change the workspace to data location
cd C: UsersDevDesktopKaggleCredit Card Fraud

 
# Load dataset

df = pd.read_csv ( ` creditcard.csv` )

 
# Align time values ​​with future work

df [ `Time` ] = df [ ` Time` ]. apply ( lambda x: (x / 3600 ) % 24 )

  
# Separate regular and fraudulent transactions

fraud = df [df [ `Class` ] = = 1 ]

normal = df [df [ ` Class` ] = = 0 ]. sample ( 2500 )

 
# Dataset reduction due to machine limitations

df = normal.append (fraud) .reset_index (drop = True )

 
# Separate dependent and independent variables

y = df [ `Class` ]

X = df.drop ( `Class` , axis = 1 )

Step 3: Examine the data

a)

df.head ()

b)

df.info ()

c)

df.describe ()

Step 4: Defining a utility function for plotting data

def tsne_plot (x, y):

 

# Setting the background image

sns. set (style = "whitegrid" )

 

  tsne = TSNE (n_components = 2 , random_state = 0 )

  

# Downsizing data

X_transformed = tsne.fit_transform (x)

 

plt.figure (figsize = ( 12 , 8 ))

 

  # Building a dot plot

  plt.scatter (X_transformed [np.where (y = = 0 ), 0 ], 

X_transformed [np.where (y = = 0 ), 1 ],

marker = `o` , color = `y` , linewidth = `1` ,

  alpha = 0.8 , label = ` Normal` )

plt.scatter (X_transformed [np.where (y = = 1 ), 0 ],

X_transformed [np.where (y = = 1 ), 1 ],

marker = `o` , color = ` k` , linewidth = `1` ,

alpha = 0.8 , label = `Fraud` )

 

# Specify the location of the legend

plt.legend (loc = `best` )

 

# Build abbreviated data

plt.show ()

Step 5: Visualize the raw data

tsne_plot (X, y)

Please note that data is currently not easy to split. In the next steps we will try to encode the data with the auto-encoder and parse the results.

Step 6: Clean up the data to make it suitable for the auto-encoder

# Scale data to make it suitable for auto-encoder

X_scaled = MinMaxScaler (). fit_transform (X)

X_normal_scaled = X_scaled [y = = 0 ]

X_fraud_scaled = X_scaled [y = = 1 ]

Step 7 : Building an auto-encoder neural network

# Building the input layer

input_layer = Input (shape = (X.shape [ 1 ], ))

 
# Building the encoder network

encoded = Dense ( 100 , activation = `tanh` ,

  activity_regularizer = regularizers.l1 ( 10e - 5 )) (input_layer)

encoded = Dense ( 50 , activation = `tanh` ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 25 , activation = `tanh` ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 12 , activation = ` tanh` ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 6 , activation = `relu` ) (encoded)

 
# Building the decoder network

decoded = Dense ( 12 , activation = `tanh` ) (encoded)

decoded = Dense ( 25 , activation = `tanh` ) (decoded)

decoded = Dense ( 50 , activation = `tanh` ) (decoded)

decoded = Dense ( 100 , activation = `tanh` ) (decoded)

  
# Building the output layer

output_layer = Dense (X.shape [ 1 ] , activation = `relu` ) (decoded)

Step 8: Define and train the auto-encoder

# Define auto-encoder network parameters

autoencoder = Model (input_layer, output_layer)

autoencoder. compile (optimizer = "adadelta" , loss = " mse " )

 
# Train the auto-encoder network
autoencoder.fit (X_normal_scaled, X_normal_scaled, 

batch_size = 16 , epochs = 10

shuffle = True , validation_split = 0.20 )

Step 9: Save the Auto-encoder part to encode data

Step 10: Encoding data and visualizing encoded data

hidden_representation = Sequential ()

hidden_representation.add (autoencoder.layers [ 0 ])

hidden_representation.add (autoencoder.layers [ 1 ])

hidden_representation.add (autoencoder.layers [ 2 ] )

hidden_representation.add (autoencoder.layers [ 3 ])

hidden_representation.add (autoencoder.la yers [ 4 ])

# Separating Auto-encoded dots as normal and fraudulent

normal_hidden_rep = hidden_representation.predict (X_normal_scaled)

fraud_hidden_rep = hidden_representation.predict (X_fraud_scaled)

 
# Combine encoded points into one table

encoded_X = np.append (normal_hidden_rep, fraud_hidd en_rep, axis = 0 )

y_normal = np.zeros (normal_hidden_rep.shape [ 0 ])

y_fraud = np.ones (fraud_hidden_rep.shape [ 0 ])

encoded_y = np.append (y_normal, y_fraud)

 
# Building coded points
tsne_plot (encoded_X, encoded_y)

Note that after encoding the data, the data is closer to linear separation. Thus, in some cases, encoding the data can help make the classification boundary linear for the data. To analyze this point numerically, we will fit a linear logistic regression model to the encoded data and a support vector classifier for the original data.

Step 11: Separate raw and encoded data into training and test data

Step 12: Build a logistic regression model and evaluate its performance

# Split encoded data for linear classification

X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split (encoded_X, encoded_y, test_size = 0.2 )

 
# Separation of source data for nonlinear classification

X_tra in, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2 )

# Building a logistic regression model

lrclf = LogisticRegression ()

lrclf.fit (X_train_encoded, y_train_encoded)

  
# Saving linear model predictions

y_pred_lrclf = lrclf.predict (X_test_encoded)

 
# Estimating linear model performance

print ( ` Accuracy: ` + str (accuracy_score (y_test_encoded, y_pred_lrclf)))

Step 13: Building a support vector classifier model and evaluating its performance

# Building SVM model

svmclf = SVC ()

svmclf.fit (X_t rain, y_train)

 
# Saving nonlinear model predictions

y_pred_svmclf = svmclf.predict (X_test)

 
# Evaluating nonlinear model performance

print ( `Accuracy:` + str (accuracy_score (y_test, y_pred_svmclf)))

So the performance metrics support the point above that encoding data can sometimes be useful for generating data. x are linearly separable, as the performance of the Logistic Linear Regression model is very close to the performance of the Nonlinear Support Vector model by the classifier.