ML | Variational Bayesian inference for a Gaussian mixture

The Gaussian mixture model assumes that the data should be divided into clusters in such a way that each data point in a given cluster corresponds to a certain multivariate Gaussian distribution, and the multivariate Gaussian distributions of each cluster are independent of each other. To cluster data in such a model, it is necessary to calculate the posterior probability of a data point belonging to a given cluster given the observed data. An exemplary method for this purpose is the Bayeux method. But for large datasets, calculating the marginal probabilities is very tedious. Since there is only a need to find the most likely cluster for a given point, approximation methods can be used as they reduce mechanical work. One of the best approximate methods is the use of Variational Bayesian inference. The method uses the concepts KL divergences and mean field approximation .

The following steps demonstrate how to implement variational Bayesian inference in a Gaussian mixture model using Sklearn. Used data — these are credit card details which can be downloaded from Kaggle .

Step 1: Import required libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn. mixture import BayesianGaussianMixture

from sklearn.preprocessing import normalize, StandardScaler

from sklearn.decomposition import PCA

Step 2: Load and clear data

# Change workplace to data location

cd "C: UsersDevDesktopKaggleCredit_Card"

 
# Loading data

X = pd.read_csv ( `CC_GENERAL.csv` )

  
# Remove the CUST_ID column from the data

X = X.drop ( `CUST_ID` , axis = 1 )

 
# Handling missing values ​​

X.fillna (method = `ffill` , inplace = True )

 
X.head ()

Step 3: Data preprocessing

# Scale data to bring all attributes to comparable level

scaler = StandardScaler ()

X_scaled = scaler.fit_transform (X)

  
# Normalize the data so that the data
# approximately follows a Gaussian distribution

X_normalized = normalize (X_scaled)

 
# Convert numpy array to panda DataFrame

X_normalized = pd.DataFrame (X_normalized)

 
# Renaming columns

X_normalized.columns = X.columns

 
X_normalized.head ()

Step 4: Downsize the data to make it renderable

# Reducing data size

pca = PCA (n_components = 2 )

X_principal = pca.fit_transform (X_normalized)

 
# Convert minified data to pandas data frame

X_principal = pd.DataFrame (X_principal)

 
# Renaming columns

X_principal.columns = [ `P1` , `P2` ]

  
X_principal.head ()

The primary two parameters of the Bayesian Gaussian class blends are n_components and covariance_type .

  1. n_components: defines the maximum number of clusters in the data.
  2. covariance_type: describes the type of use covariance parameters.

You can read about all the other attributes in his documentation .

In the next steps, the n_components parameter will be fixed at level 5, and the covariance_type parameter will be changed for all possible values ​​to visualize the effect of this parameter on clustering.

Step 5: Build clustering models for different covariance_type values ​​and visualize the results

a) covariance_type = & # 39; full & # 39;

# Building and training the model

vbgm_model_full = BayesianGaussianMixture (n_components = 5 , covariance_type = `full` )

vbgm_model_full.fit (X_normalized)

 
# Storing tags

labels_full = vbgm_model_full.predict (X)

print ( set (labels_full))

colors = {}

colors [ 0 ] = `r`

colors [ 1 ] = `g`

colors [ 2 ] = `b`

colors [ 3 ] = `k`

  
# Build a color vector for each data point

cvec = [colors [label] for label in labels_full]

 
# Define a scatter plot for each color

r = plt.scatter (X_principal [ ` P1` ], X_principal [ `P2` ], color = `r` );

g = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `g` );

b = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `b` );

k = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `k` );

 
# Building clustered data

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ `P1` ], X_principal [ `P2` ], c = cvec)

plt.legend ((r, g, b, k), ( ` Label 0` , `Label 1` , `Label 2` , ` Label 3` ))

plt.show ()

b) covariance_type = & # 39; related & # 39;

# Building and training the model

vbgm_model_tied = BayesianGaussianMixture (n_components = 5 , covariance_type = `tied` )

vbgm_model_tied.fit (X_normalized )

 
# Storing tags

labels_tied = vbgm_model_tied.predict (X)

print ( set (labels_tied))

colors = {}

colors [ 0 ] = `r`

colors [ 2 ] = `g`

colors [ 3 ] = `b`

colo urs [ 4 ] = ` k`

 
# Build a color vector for each data point

cvec = [colors [label] for label in labels_tied]

 
# Define a scatter plot for each color

r = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `r` );

g = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `g` );

b = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `b` );

k = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `k` );

 
# Building clustered data

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ `P1` ], X_principal [ `P2` ], c = cvec)

plt.legend ((r, g, b, k), ( ` Label 0` , `Label 2` , `Label 3` , ` Label 4` ))

plt.show ()

c) covariance_type = & # 39; diag & # 39;

# Building and training the model

vbgm_model_diag = BayesianGaussianMixture (n_components = 5 , covariance_type = `diag` )

vbgm_model_diag.fit (X_normalized )

 
# Storing tags

labels_diag = vbgm_model_diag.predict (X)

print ( set (labels_diag))

colors = {}

colors [ 0 ] = `r`

colors [ 2 ] = `g`

colors [ 4 ] = `k`

 
# Build a color vector for each data point

cvec = [colors [label] for label in labels_diag]

 
# Define a scatter plot for each colors

r = plt.scatter ( X_principal [ `P1` ], X_principal [ ` P2` ], color = `r` );

g = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `g` );

k = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `k` );

 
# Building clustered data

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ `P1` ], X_principal [ `P2` ], c = cvec)

plt.legend ((r, g, k), ( `Label 0` , ` Label 2` , `Label 4` ))

plt.show ()

d) covariance_type = & # 39; spherical & # 39;

# Building and training the model

vbgm_model_spherical = BayesianGaussianMixture (n_components = 5 ,

covariance_type = `spherical` )

vbgm_model_spherical.fit (X_normalized)

 
# Storing tags

labels_spherical = vbgm_model_sp herical.predict (X)

print ( set (labels_spherical))

colors = {}

colors [ 2 ] = ` r`

colors [ 3 ] = `b`

  
# Build a color vector for each data point

cvec = [colors [label] for label in labels_spherical]

 
# Define a scatter plot for each color

r = plt.scatter (X_principal [ `P1` ], X_principal [ `P2` ], color = `r` );

b = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `b` );

 
# Building clustered data

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ `P1` ], X_principal [ `P2` ], c = cvec)

plt.legend ((r, b), ( `Label 2` , `Label 3` ))

plt.show ()