30 minutes to machine learning



1. Downloading, Installing and Running Python SciPy

Install the Python and SciPy framework on your system if you haven`t already. You can easily follow the installation guide for it.
1.1 Install SciPy libraries

Running on Python 2.7 or 3.5+.
There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

  • SciPy
  • NumPy
  • Matplotlib
  • pandas
  • sklearn

1.2 Start Python and check versions
It`s a good idea to make sure your Python environment has been successfully installed and is working properly way.
The script below will help you check the environment. It imports every library needed in this tutorial and prints the version.
Type or copy and paste the following script:

# Check library versions

  
# Python version

import sys

print ( `Python: {}` . format (sys.version) )

# scipy

import scipy

print ( `scipy: {}` . format (scipy .__ version__))

# Numpy

import numpy

print ( `numpy: {}` . format (numpy .__ version__))

# matplotlib

import matplotlib

print ( `matplotlib: {}` . format (matplotlib .__ version__))

#pandas

import pandas

print ( `pandas: {}` . format (pandas .__ version__))

# scikit-learn

import sklearn

print ( ` sklearn: {} ` . format (sklearn .__ version__))

If you get an error, stop. Now is the time to fix that.

2. Load the data.
Dataset — rainbow shell data
This is a well-known data used by almost everyone as a dataset "hello world ”in machine learning and statistics.
The dataset contains 150 observations of iris flowers. There are four columns of color measurements in centimeters. The fifth column — kind of observed flower. All observed flowers are of one of three kinds.

2.1 Importing Libraries
First, let`s import all modules, functions and objects that will be used.

# Load libraries

 

import pandas

from pandas.plotting import scatter_matrix

import matplotlib.pyplot as plt

from sklearn import model_selection

from sklearn.metrics import classification_report

from sklearn. metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

A working SciPy environment is required to continue.

2.2 Download dataset

Data can be directly loaded into the UCI machine learning repository.
Using pandas to load data and explore descriptive statistics and data visualization.

Note — the names of each column are specified when the data is loaded. This will help later during data exploration.

url =  
" https://raw.githubusercontent.com / jbrownlee / Datasets / master / iris.csv "

names = [ `sepal-length` , `sepal-width` , ` petal-length` ,

`petal-width` , `class` ]

dataset = pandas.read_csv (url, names = names)

If you have network problems, you can upload the iris.csv file to your working directory and upload it in the same way, changing the URL to the name of the local file.

3. Summarize the dataset
Now it`s time to look at the data.

Steps to look at the data in several different ways:

  • Dimensions of the dataset.
  • Look at the data itself.
  • A statistical summary of all attributes.
  • Data breakdown by class variable

3.1 Dataset Sizes

# form

print (dataset.shape)

 (150, 5) 

3.2 Look at the data

# head

print (dataset.head ( 20 ))

 sepal-length sepal-width petal-length petal -width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris- setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 5.4 3.7 1.5 0.2 Iris-setosa 11 4.8 3.4 1.6 0.2 Iris-setosa 12 4.8 3.0 1.4 0.1 Iris-setosa 13 4.3 3.0 1.1 0.1 Iris-setosa 14 5.8 4.0 1.2 0.2 Iris-setosa 15 5.7 4.4 1.5 0.4 Iris-setosa 16 5.4 3.9 1.3 0.4 Iris-setosa 17 5.1 3.5 1.4 0.3 Iris-setosa 18 5.7 3.8 1.7 0.3 Iris-setosa 19 5.1 3.8 1.5 0.3 Iris-setosa 

3.3 Statistical summary
This includes counts, average, minimum and maximum values, as well as some percentiles.

# descriptions

print (dataset.describe () )

It is clearly seen that all numerical values ​​have the same scale ( centimeters) and similar ranges from 0 to 8 centimeters.

 sepal-length sepal-width petal-length petal-width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.0000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000 

3.4 Class distribution

# class distribution

print (dataset.groupby ( `class` ). size ())

 class Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 

4. Data visualization
Using two types of graphs:

  1. One-dimensional graphs to better understand each attribute.
  2. Multidimensional graphs to better understand the relationship between attributes .

4.1 one-dimensional plots
one-dimensional plots — graphs of each individual variable.
Given that the input variables are numeric, we can create box and whisker graphics for each.

# field and mustache

dataset.plot (kind = `box` , subplots = True

layout = ( 2 , 2 ), sharex = False , sharey = False )

plt.show ()

Generate a histogram of each input variable to get an idea of ​​the distribution.

# histograms
dataset.hist ()
plt.show ()

It looks like the two input variables are Gaussian. This is useful to note as we can use algorithms that can exploit this assumption.

4.2. Multidimensional plots
Interactions between variables.
First, let`s take a look at the scatter plots of all attribute pairs. This can be useful for defining structured relationships between input variables.

# scatter plot matrix
scatter_matrix (dataset)
plt.show ()

Notice the diagonal grouping of some of the attribute pairs. This indicates a high correlation and predictable relationship.

5. Evaluate some algorithms
Create some data models and evaluate their accuracy from invisible data.

  1. Highlight a validation dataset.
  2. Install a test harness to use 10-fold cross-validation.
  3. Build 5 different models to predict species from flower measurements
  4. Choose the best model.

5.1 Creating a validation dataset
Using statistical techniques to assess the accuracy of the models we create on invisible data. A concrete estimate of the accuracy of the best model from invisible data is made by evaluating it from actual invisible data.
Some data is used as test data that algorithms cannot see, and this data provides a second and independent idea of ​​how accurate the best model can be.

Test data is divided into two parts, 80% of which we will use to train our models, and 20% that we will store as a dataset for validation.

# Split validation dataset

array = dataset.values ​​

X = array [:, 0 : 4 ]

Y = array [:, 4 ]

validation_size = 0.20

seed = 7

X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split (

X, Y, test_size = validation_size, random_state = seed)

X_train and Y_train — this is training data for preparing models, and the X_validation and Y_validation sets can be used later.

5.2 Test harness

Using 10-fold cross-validation to evaluate accuracy ... This will split our dataset into 10 pieces, train by 9 and test by 1, and repeat for all workout split combinations.

# Test parameters and grade metrics

seed = 7

scoring = `accuracy`

The Accuracy metric is used to evaluate models. This is the ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to get a percentage (for example, with 95% accuracy).

5.3 Build models

Evaluating 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB) .
  • Support Vector Machines (SVM).

The selected algorithms are a mixture of linear (LR and LDA) and nonlinear (KNN, CART, NB and SVM) algorithms. A random seed is reset before each run to ensure that each algorithm is evaluated using exactly the same data. This ensures that the results are directly comparable.

Model Building and Evaluation:

Selective validation algorithms

models = []

models.append (( `LR` , LogisticRegression (solver = `liblinear` , multi_class = `ovr` )))

models.append (( `LDA` , LinearDiscriminantAnalysis ()))

models.append (( `KNN` , KNeighborsClassifier ()))

models.append (( ` CART` , DecisionTreeClassifier ()))

models.append (( `NB` , GaussianNB ()))

models.append (( `SVM` , SVC (gamma = `auto` )))

 
# evaluate each model in turn

results = []

names = []

 

for name, model in models:

  kfold = model_selection.KFold (n_splits = 10 , random_state = seed)

cv_results = model_selection.cross_val_score (

model, X_train, Y_train, cv = kfold, scoring = scoring)

results.append (cv_results)

names.append (name)

  msg = "% s:% f (% f) " % (name, cv_results.mean (), cv_results.std ())

print (msg)

5.4 Choose the best model
Comparing models with each other and choosing the most accurate … Running the above example will give you the following raw results:

 LR: 0.966667 (0.040825) LDA: 0.975000 (0.038188) KNN: 0.983333 (0.033333) CART: 0.975000 (0.038188) NB: 0.975000 (0.053359) SVM: 0.991667 (0.025000) 

Support Vector Machines (SVMs) have the highest accuracy score.
A graph of the model evaluation results is generated and the spread and average accuracy of each model is compared. There are many precision metrics for each algorithm because each algorithm has been scored 10 times (10x cross validation).

# Compare algorithms

fig = plt.figure ()

fig.suptitle ( `Algorithm Comparison` )

ax = fig.add_subplot ( 111 )

plt.boxplot (results)
ax.set_xticklabels (names)
plt.show ()

Box and mustache charts are at the top of the range, with many The azots reach 100% accuracy.

6. Make predictions
The KNN algorithm is very simple and was an accurate model based on our tests.
Run the KNN model directly in the validation set and summarize the results as a final accuracy score, confusion matrix, and classification report.

# Make set predictions validation data

knn = KNeighborsClassifier ()

knn.fit (X_train, Y_train)

predictions = knn.predict (X_validation)

print (accuracy_score (Y_validation, predictions))

print (confusion_matrix (Y_validation, predictions))

print (classification_report (Y_validation, predictions))

The accuracy is 0.9 or 90%. The confusion matrix provides insight into three mistakes made. Finally, the classification report breaks down each grade by accuracy, recall, f1 score and support, showing excellent results (assuming the dataset to be checked was small).

 0.9 [[7 0 0 ] [0 11 1] [0 2 9]] precision recall f1-score support Iris-setosa 1.00 1.00 1.00 7 Iris-versicolor 0.85 0.92 0.88 12 Iris-virginica 0.90 0.82 0.86 11 micro avg 0.90 0.90 0.90 30 macro avg 0.92 0.91 0.91 30 weighted avg 0.90 0.90 0.90 30