Data classification using support vector machines (SVM) in Python



Introduction to SVM:
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used to classify and regression analysis. 
Support Vector Machine (SVM) — it is a discriminatory classifier formally defined by a dividing hyperplane. In other words, given the labeled training data (supervised learning), the algorithm produces an optimal hyperplane that classifies new examples.

What is a reference vector machine?

The SVM is a representation of examples in the form of points in space, displayed in such a way that the examples of the individual categories are separated by a clear space, as wide as possible. 
In addition to performing linear classification, SVMs can efficiently perform nonlinear classification by implicitly mapping their input to feature spaces.

What does SVM do?

Given a set of training examples, each of which is marked as belonging to one or the other of the two categories, the SVM learning algorithm builds a model that assigns new examples to a particular category, turning it into a non-probabilistic binary linear classifier.

Have some basic knowledge from this Prerequisites : Numpy , matplot-lib , scikit-learn
Let`s take a quick look at support vector classification. First we need to create a dataset:

# import scikit learn with make_blobs

from sklearn.datasets.samples_generator import make_blobs

 
# create X datasets containing n_samples
# Y containing two classes

X, Y = make_blobs (n_samples = 500 , centers = 2 ,

random_state = 0 , cluster_std = 0.40 )

 
# plotting graphs

plt.scatter (X [:, 0 ], X [:, 1 ], c = Y, s = 50 , cmap = `spring` ); 

plt.show () 

Output:

What they do support vector machines, — it is not only to draw a line here between the two classes, but also to consider the area around a line of some given width. Here`s an example of how this might look:

# create a space between -1 and 3.5

xfit = np.linspace ( - 1 , 3.5 )

 
# building a scatter plot

plt.scatter (X [:, 0 ], X [:, 1 ], c = Y, s = 50 , cmap = `spring` )

 
# draw a line between different datasets

for m, b, d in [( 1 , 0.65 , 0.33 ), ( 0.5 , 1.6 , 0.55 ), ( - 0.2 , 2.9 , 0.2 )]:

yfit = m * xfit + b

plt.plot (xfit, yfit, `-k` )

plt.fill_between (xfit, yfit - d, yfit + d, edgecolor = ` none`

color = `# AAAAAA` , alpha = 0.4 )

 

plt.xlim ( - 1 , 3.5 ); 

plt.show ()

Importing datasets

This is the intuition of support vector machines that optimize a linear discriminant model representing the perpendicular distance between datasets. Now, let`s train the classifier using our training data. Before training, we need to import the cancer datasets as a csv file, where we will train two of them all.

# import required libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

 
# read the csv file and extract the class column in y.

x = pd.read_csv ( "C: ... cancer.csv" )

a = np.array (x)

= a [:, 30 ] # classes 0 and 1

 
# extracting two functions

x = np.column_stack ((x.malignant, x.benign))

x.shape # 569 samples and 2 functions

  

print (x), (y)

 [[122.8 1001.] [132.9 1326.] [130. 1203.] ..., [108.3 858.1] [140.1 1265.] [47.92 181.]] array ([0., 0. , 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0 ., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1. , 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., ...., 1.]) 

Support Vector Fitting

Now let`s fit the SVM machine classifier to these points. While the mathematical details of the likelihood model are interesting, we`ll cover them elsewhere. Instead, we will simply view scikit-learn`s algorithm as a black box that does the above task.

# import vector classifier support

from sklearn.svm import SVC # & quot; Support Vector Classifier & quot;

clf = SVC (kernel = ` linear` )

 
# trying on x samples and classes
clf.fit (x, y)

After fitting, the model can be used to predict new values:

clf.predict ([[ 120 , 990 ]])

  

clf.predict ([[[ 85 , 550 ]])

 array ([0.]) array ([1.]) 

Let`s look at the graph as it shows.


This is obtained by analyzing the received data and preprocessing methods to create optimal hyperplanes using the matplotlib function.

This article courtesy of Afzal Ansari < / span> . If you are as Python.Engineering and would like to contribute, you can also write an article using contribute.python.engineering or by posting an article contribute @ python.engineering. See my article appearing on the Python.Engineering homepage and help other geeks.

Please post comments if you find anything wrong or if you would like to share more information on the topic discussed above.