Change language

# Python | Generating test datasets for machine learning

In this article, we will create random datasets using the Numpy library in Python.

Libraries required:

` -"  Numpy:  sudo pip install numpy -"  Pandas:  sudo pip install pandas -"  Matplotlib:  sudo pip install matplotlib `

## Normal distribution:

In probability theory, the normal or Gaussian distribution is a very common continuous probability distribution, symmetric about the mean , showing that data near the mean are more common than data that are far from the mean . Normal distributions are used in statistics and are often used to represent real random variables.

The normal distribution is the most common type of distribution in statistical analysis. The standard normal distribution has two parameters: mean and standard deviation. The mean is the central trend in the distribution. Standard deviation is a measure of variability. It defines the width of the normal distribution. The standard deviation determines how far from the mean the values ​​tend to fall. It represents the typical distance between observation and average. it corresponds to many natural phenomena, such as altitude, blood pressure, measurement uncertainty and IQ readings, all correspond to a normal distribution.

Normal distribution graph:

Example :

 ` # import libraries ` ` import ` ` pandas as pd ` ` import ` ` numpy as np ` ` import ` ` matplotlib.pyplot as plt `   ` # initialize parameters for normal ` ` # distribution, namely mean and standard. ` ` # deviation < / code> ``   # mean definition mu = 0.5 # define the standard deviation sigma = 0.1    # A random module uses an initial value as a base # to generate a random number. If the initial value is not # is present, takes the current system time. np.random .seed ( 0 )   # define x coordinates X = np.random.normal (mu, sigma, ( 395 , 1 ))   # define y coordinates Y = np.random.normal (mu * 2 , sigma * 3 , ( 39 5 , 1 ))   # build graph plt .scatter (X, Y, color = ’g’ ) plt.show () `

Output:

Let’s see a better example.

We will generate a dataset with 4 columns. Each column in the dataset represents an object. The 5th column of the dataset is the output label. This ranges between 0-3. This dataset can be used to train a classifier such as a logistic regression classifier, neural network classifier, support vector machines, etc.

` `

` # import libraries import numpy as np import pandas as pd import math import random import matplotlib.pyplot as plt     # defining columns using normal distribution   # column 1 < code class = "plain"> point1 = abs (np.random .normal ( 1 , 12 , 100 )) # Column 2 point2 = abs (np.random.normal ( 2 , 8 , 100 )) # column 3 point3 = abs (np.random.normal ( 3 , 2 , 100 )) No. column 4 point4 = abs (np.random.normal ( 10 , 15 , 100 ))   # x contains the features of our dataset # the dots are connected horizontally # using numpy to form the vector element. x = np.c_ [point1, point2, point3, point4]   # output labels range from 0 to 3 y = [ int (np.random.randint ( 0 , 4 )) for i in range ( 100 )]   # define a pandas data frame to save # data for later use data = pd.DataFrame ()    # definition of dataset columns data [ ’col1’  ] = point1 data [ ’col2’ ] = point2 data [ ’col3’ ] = point3 data [ ’col4’ ] = point4   # plotting various functions (x) # against shortcuts (y). plt.subplot ( 2 , 2 , 1 ) plt.title ( ’col1’ ) plt.scatter (y, point1, color = ’r’ , label = ’col1’ )   plt.subplot ( 2 , 2 , 2 ) plt.title ( ’Col2’ ) plt.scatter ( y, point2, color = ’g’ , label = ’col2’ )    plt.subplot ( 2 , 2 , 3 ) plt.title ( ’Col3’ ) plt.scatter (y, point3, color = ’b’ , label = ’col3’ )    plt.subplot ( 2 , 2 , 4 ) plt.title ( ’Col4’ ) plt.scatter (y, point4, color = ’y’ , label = ’col4’ )   # save the graphic plt.savefig ( ’data_visualization.jpg’ )    # graph display plt.show () `

` `

Output:

## Shop

Learn programming in R: courses

\$FREE

Best Python online courses for 2022

\$FREE

Best laptop for Fortnite

\$399+

Best laptop for Excel

\$

Best laptop for Solidworks

\$399+

Best laptop for Roblox

\$399+

Best computer for crypto mining

\$499+

Best laptop for Sims 4

\$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

PythonStackOverflow

Check if one list is a subset of another in Python

PythonStackOverflow

How to specify multiple return types using type-hints

PythonStackOverflow

Printing words vertically in Python

PythonStackOverflow

Python Extract words from a given string

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

PythonStackOverflow

Python os.path.join () method

PythonStackOverflow

Flake8: Ignore specific warning for entire file

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries