 # Python | Generating test datasets for machine learning

In this article, we will create random datasets using the Numpy library in Python.

Libraries required:

` - & gt;  Numpy:  sudo pip install numpy - & gt;  Pandas:  sudo pip install pandas - & gt;  Matplotlib:  sudo pip install matplotlib `

## Normal distribution:

In probability theory, the normal or Gaussian distribution is a very common continuous probability distribution, symmetric about the mean, showing that data near the mean are more common than data that are far from the mean. Normal distributions are used in statistics and are often used to represent real random variables.

The normal distribution is the most common type of distribution in statistical analysis. The standard normal distribution has two parameters: mean and standard deviation. The mean is the central trend in the distribution. Standard deviation is a measure of variability. It defines the width of the normal distribution. The standard deviation determines how far from the mean the values ​​tend to fall. It represents the typical distance between observation and average. it corresponds to many natural phenomena, such as altitude, blood pressure, measurement uncertainty and IQ readings, all correspond to a normal distribution.

Normal distribution graph: Example :

 ` # import libraries ` ` import ` ` pandas as pd ` ` import ` ` numpy as np ` ` import ` ` matplotlib.pyplot as plt `   ` # initialize parameters for normal ` ` # distribution, namely mean and standard. ` ` # deviation < / code> ````   # mean definition mu = 0.5 # define the standard deviation sigma = 0.1    # A random module uses an initial value as a base # to generate a random number. If the initial value is not # is present, takes the current system time. np.random .seed ( 0 )   # define x coordinates X = np.random.normal (mu, sigma, ( 395 , 1 ))   # define y coordinates Y = np.random.normal (mu * 2 , sigma * 3 , ( 39 5 , 1 ))   # build graph plt .scatter (X, Y, color = `g` ) plt.show () ```

Output: Let`s see a better example.

We will generate a dataset with 4 columns. Each column in the dataset represents an object. The 5th column of the dataset is the output label. This ranges between 0-3. This dataset can be used to train a classifier such as a logistic regression classifier, neural network classifier, support vector machines, etc.

` `

``` # import libraries import numpy as np import pandas as pd import math import random import matplotlib.pyplot as plt     # defining columns using normal distribution   # column 1 < code class = "plain"> point1 = abs (np.random .normal ( 1 , 12 , 100 )) # Column 2 point2 = abs (np.random.normal ( 2 , 8 , 100 )) # column 3 point3 = abs (np.random.normal ( 3 , 2 , 100 )) No. column 4 point4 = abs (np.random.normal ( 10 , 15 , 100 ))   # x contains the features of our dataset # the dots are connected horizontally # using numpy to form the vector element. x = np.c_ [point1, point2, point3, point4]   # output labels range from 0 to 3 y = [ int (np.random.randint ( 0 , 4 )) for i in range ( 100 )]   # define a pandas data frame to save # data for later use data = pd.DataFrame ()    # definition of dataset columns data [ `col1`  ] = point1 data [ `col2` ] = point2 data [ `col3` ] = point3 data [ `col4` ] = point4   # plotting various functions (x) # against shortcuts (y). plt.subplot ( 2 , 2 , 1 ) plt.title ( `col1` ) plt.scatter (y, point1, color = `r` , label = `col1` )   plt.subplot ( 2 , 2 , 2 ) plt.title ( `Col2` ) plt.scatter ( y, point2, color = `g` , label = `col2` )    plt.subplot ( 2 , 2 , 3 ) plt.title ( `Col3` ) plt.scatter (y, point3, color = `b` , label = `col3` )    plt.subplot ( 2 , 2 , 4 ) plt.title ( `Col4` ) plt.scatter (y, point4, color = `y` , label = `col4` )   # save the graphic plt.savefig ( `data_visualization.jpg` )    # graph display plt.show () ```

` `

Output: 