Python | Generating test datasets for machine learning



In this article, we will create random datasets using the Numpy library in Python.

Libraries required:

 - & gt;  Numpy:  sudo pip install numpy - & gt;  Pandas:  sudo pip install pandas - & gt;  Matplotlib:  sudo pip install matplotlib 

Normal distribution:

In probability theory, the normal or Gaussian distribution is a very common continuous probability distribution, symmetric about the mean, showing that data near the mean are more common than data that are far from the mean. Normal distributions are used in statistics and are often used to represent real random variables.

The normal distribution is the most common type of distribution in statistical analysis. The standard normal distribution has two parameters: mean and standard deviation. The mean is the central trend in the distribution. Standard deviation is a measure of variability. It defines the width of the normal distribution. The standard deviation determines how far from the mean the values ​​tend to fall. It represents the typical distance between observation and average. it corresponds to many natural phenomena, such as altitude, blood pressure, measurement uncertainty and IQ readings, all correspond to a normal distribution.

Normal distribution graph:

Example :

# import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

 
# initialize parameters for normal
# distribution, namely mean and standard.
# deviation < / code>

 
# mean definition

mu = 0.5

# define the standard deviation

sigma = 0.1

  
# A random module uses an initial value as a base
# to generate a random number. If the initial value is not
# is present, takes the current system time.

np.random .seed ( 0 )

 
# define x coordinates

X = np.random.normal (mu, sigma, ( 395 , 1 ))

 
# define y coordinates

Y = np.random.normal (mu * 2 , sigma * 3 , ( 39 5 , 1 ))

 
# build graph

plt .scatter (X, Y, color = `g` )

plt.show ()

Output:

Let`s see a better example.

We will generate a dataset with 4 columns. Each column in the dataset represents an object. The 5th column of the dataset is the output label. This ranges between 0-3. This dataset can be used to train a classifier such as a logistic regression classifier, neural network classifier, support vector machines, etc.

# import libraries

import numpy as np

import pandas as pd

import math

import random

import matplotlib.pyplot as plt 

  
# defining columns using normal distribution

 
# column 1

< code class = "plain"> point1 = abs (np.random .normal ( 1 , 12 , 100 ))

# Column 2

point2 = abs (np.random.normal ( 2 , 8 , 100 ))

# column 3

point3 = abs (np.random.normal ( 3 , 2 , 100 ))

No. column 4

point4 = abs (np.random.normal ( 10 , 15 , 100 ))

 
# x contains the features of our dataset
# the dots are connected horizontally
# using numpy to form the vector element.

x = np.c_ [point1, point2, point3, point4]

 
# output labels range from 0 to 3

y = [ int (np.random.randint ( 0 , 4 )) for i in range ( 100 )]

 
# define a pandas data frame to save
# data for later use

data = pd.DataFrame ()

  
# definition of dataset columns

data [ `col1`  ] = point1

data [ `col2` ] = point2

data [ `col3` ] = point3

data [ `col4` ] = point4

 
# plotting various functions (x)
# against shortcuts (y).

plt.subplot ( 2 , 2 , 1 )

plt.title ( `col1` )

plt.scatter (y, point1, color = `r` , label = `col1` )

 

plt.subplot ( 2 , 2 , 2 )

plt.title ( `Col2` )

plt.scatter ( y, point2, color = `g` , label = `col2` )

  

plt.subplot ( 2 , 2 , 3 )

plt.title ( `Col3` )

plt.scatter (y, point3, color = `b` , label = `col3` )

  

plt.subplot ( 2 , 2 , 4 )

plt.title ( `Col4` )

plt.scatter (y, point4, color = `y` , label = `col4` )

 
# save the graphic

plt.savefig ( `data_visualization.jpg`

 
# graph display
plt.show ()

Output: