Change language

ML | Data classification using auto-encoder

| |

This article will show you how to use an auto-encoder to classify data. The data used below represents credit card transactions to predict whether a given transaction is fraudulent or not. The data can be downloaded here .

Step 1: Download the required libraries

import pandas as pd 

import numpy as np

from sklearn.model_selection import train_test_split 

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import MinMaxScaler 

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

import seaborn as sns

from keras.layers import Input , Dense

from keras.models import Model , Sequential

from keras import regularizers

Step 2: Load data

# Change the workspace to data location
cd C: UsersDevDesktopKaggleCredit Card Fraud

 
# Load dataset

df = pd.read_csv ( ’ creditcard.csv’ )

 
# Align time values ​​with future work

df [ ’Time’ ] = df [ ’ Time’ ]. apply ( lambda x: (x / 3600 ) % 24 )

  
# Separate regular and fraudulent transactions

fraud = df [df [ ’Class’ ] = = 1 ]

normal = df [df [ ’ Class’ ] = = 0 ]. sample ( 2500 )

 
# Dataset reduction due to machine limitations

df = normal.append (fraud) ..reset_index (drop = True )

 
# Separate dependent and independent variables

y = df [ ’Class’ ]

X = df.drop ( ’Class’ , axis = 1 )

Step 3: Examine the data

a)

df.head ()

b)

df.info ()

c)

df.describe ()

Step 4: Defining a utility function for plotting data

def tsne_plot (x, y):

 

# Setting the background image

sns. set (style = "whitegrid" )

 

  tsne = TSNE (n_components = 2 , random_state = 0 )

  

# Downsizing data

X_transformed = tsne.fit_transform (x)

 

plt.figure (figsize = ( 12 , 8 ))

 

  # Building a dot plot

  plt.scatter (X_transformed [np.where (y = = 0 ), 0 ], 

X_transformed [np.where (y = = 0 ), 1 ],

marker = ’o’ , color = ’y’ , linewidth = ’1’ ,

  alpha = 0.8 , label = ’ Normal’ )

plt.scatter (X_transformed [np.where (y = = 1 ), 0 ],

X_transformed [np.where (y = = 1 ), 1 ],

marker = ’o’ , color = ’ k’ , linewidth = ’1’ ,

alpha = 0.8 , label = ’Fraud’ )

 

# Specify the location of the legend

plt.legend (loc = ’best’ )

 

# Build abbreviated data

plt.show ()

Step 5: Visualize the raw data

tsne_plot (X, y)

Please note that data is currently not easy to split. In the next steps we will try to encode the data with the auto-encoder and parse the results.

Step 6: Clean up the data to make it suitable for the auto-encoder

# Scale data to make it suitable for auto-encoder

X_scaled = MinMaxScaler (). fit_transform (X)

X_normal_scaled = X_scaled [y = = 0 ]

X_fraud_scaled = X_scaled [y = = 1 ]

Step 7 : Building an auto-encoder neural network

# Building the input layer

input_layer = Input (shape = (X.shape [ 1 ], ))

 
# Building the encoder network

encoded = Dense ( 100 , activation = ’tanh’ ,

  activity_regularizer = regularizers.l1 ( 10e - 5 )) (input_layer)

encoded = Dense ( 50 , activation = ’tanh’ ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 25 , activation = ’tanh’ ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 12 , activation = ’ tanh’ ,

activity_regularizer = regularizers.l1 ( 10e - 5 )) (encoded)

encoded = Dense ( 6 , activation = ’relu’ ) (encoded)

 
# Building the decoder network

decoded = Dense ( 12 , activation = ’tanh’ ) (encoded)

decoded = Dense ( 25 , activation = ’tanh’ ) (decoded)

decoded = Dense ( 50 , activation = ’tanh’ ) (decoded)

decoded = Dense ( 100 , activation = ’tanh’ ) (decoded)

  
# Building the output layer

output_layer = Dense (X.shape [ 1 ] , activation = ’relu’ ) (decoded)

Step 8: Define and train the auto-encoder

# Define auto-encoder network parameters

autoencoder = Model (input_layer, output_layer)

autoencoder. compile (optimizer = "adadelta" , loss = " mse " )

 
# Train the auto-encoder network
autoencoder.fit (X_normal_scaled, X_normal_scaled, 

batch_size = 16 , epochs = 10

shuffle = True , validation_split = 0.20 )

Step 9: Save the Auto-encoder part to encode data

Step 10: Encoding data and visualizing encoded data

hidden_representation = Sequential ()

hidden_representation.add (autoencoder.layers [ 0 ])

hidden_representation.add (autoencoder.layers [ 1 ])

hidden_representation.add (autoencoder.layers [ 2 ] )

hidden_representation.add (autoencoder.layers [ 3 ])

hidden_representation.add (autoencoder.la yers [ 4 ])

# Separating Auto-encoded dots as normal and fraudulent

normal_hidden_rep = hidden_representation.predict (X_normal_scaled)

fraud_hidden_rep = hidden_representation.predict (X_fraud_scaled)

 
# Combine encoded points into one table

encoded_X = np.append (normal_hidden_rep, fraud_hidd en_rep, axis = 0 )

y_normal = np.zeros (normal_hidden_rep.shape [ 0 ])

y_fraud = np.ones (fraud_hidden_rep.shape [ 0 ])

encoded_y = np.append (y_normal, y_fraud)

 
# Building coded points
tsne_plot (encoded_X, encoded_y)

Note that after encoding the data, the data is closer to linear separation. Thus, in some cases, encoding the data can help make the classification boundary linear for the data. To analyze this point numerically, we will fit a linear logistic regression model to the encoded data and a support vector classifier for the original data.

Step 11: Separate raw and encoded data into training and test data

Step 12: Build a logistic regression model and evaluate its performance

# Split encoded data for linear classification

X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split (encoded_X, encoded_y, test_size = 0.2 )

 
# Separation of source data for nonlinear classification

X_tra in, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2 )

# Building a logistic regression model

lrclf = LogisticRegression ()

lrclf.fit (X_train_encoded, y_train_encoded)

  
# Saving linear model predictions

y_pred_lrclf = lrclf.predict (X_test_encoded)

 
# Estimating linear model performance

print ( ’ Accuracy: ’ + str (accuracy_score (y_test_encoded, y_pred_lrclf)))

Step 13: Building a support vector classifier model and evaluating its performance

# Building SVM model

svmclf = SVC ()

svmclf.fit (X_t rain, y_train)

 
# Saving nonlinear model predictions

y_pred_svmclf = svmclf.predict (X_test)

 
# Evaluating nonlinear model performance

print ( ’Accuracy:’ + str (accuracy_score (y_test, y_pred_svmclf)))

So the performance metrics support the point above that encoding data can sometimes be useful for generating data. x are linearly separable, as the performance of the Logistic Linear Regression model is very close to the performance of the Nonlinear Support Vector model by the classifier.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically