ML | Diagnosing Kaggle Breast Cancer Wisconsin Using KNN



Implementation of the CNN algorithm for classification.

Code: loading libraries

# doing linear algebra

import numpy as np 

 
# data processing

import pandas as pd

 
# rendering

import matplotlib.pyplot as plt

Code: Load dataset

df = pd.read_csv ( ".. breast-cancer-wisconsin-data data.csv" )

  

print (data.head)

Output:

Code: Load dataset

df.info ()

Output:

 RangeIndex: 569 entries , 0 to 568 Data columns (total 33 columns): id 569 non-null int64 diagnosis 569 non-null object radius_mean 569 non-null float64 texture_mean 569 non-null float64 perimeter_mean 569 non-null float64 area_mean 569 non-null float64 smoothness_mean 569 non-null float64 compactness_mean 569 non-null float64 concavity_mean 569 non-null float64 concave points_mean 569 non-null float64 symmetry_mean 569 non-null float64 texture-fractal_mean 569 non-radius_ null float 569 non-null float64 perimeter_se 569 non-null float64 area_se 569 non-null float64 smoothness_se 569 non-null float64 compactness_se 569 non-null float64 concavity_se 569 non-null float64 concave points_se 569 non-null float64 symmetry_se 569 non-nulldimension float64 non-null float64 radius_worst 569 non-null float64 texture_worst 569 non-null float64 perimeter_worst 569 non-null float64 area_worst 569 non-null float64 smoothness_worst 569 non-null float64 compactness_worst 569 non-null float64 concavity_worst 569 non-null float64 concavity_worst nullc points 569 non-null float64 concavity_worst nullc -null float64 symmetry_worst 569 non-null float64 fractal_dimension_worst 569 non-null float64 Unnamed: 32 0 non -null float64 dtypes: float64 (31), int64 (1), object (1) memory usage: 146.8+ KB 

Code: we are dropping columns — “Id” and “Unnamed: 32” as they play no role in forecasting.

df.drop ([ `Unnamed: 32` , ` id` ], axis = 1 )

print (df.shape)

Exit:

 (569, 31) 

Convert diagnostic M and B values ​​to numerical value
M (Malignant) = 1
B (Benign) = 0

def diagnosis_value (diagnosis):

  if diagnosis = = `M` :

  return 1

  else :

return 0

 

df [ `diagnosis` ] = df [ ` diagnosis` ]. apply (diagnosis_value)

Code:

sns.lmplot (x = `radius_mean` , y = `texture_mean` , hue = `diagnosis` , data = df)

Exit:

 

Code :

sns.lmplot (x = ` smoothness_mean` , y = `compactness_mean`

data = df, hue = `diagnosis` )

Output:

 

Code: input and output

X = np.array (df.iloc [:, 1 :])

y = np.array (df [ `diagnosis` ])

Code: Separating data for training and testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.33 , random_state = 42 )

Code: Using Sklearn

knn = KNeighborsClassifier (n_neighbors = 13 )

knn.fit (X_train, y_train)

Exit:

 KNeighborsClassifier (algorithm = `auto` , leaf_size = 30, metric = `minkowski`, metric_params = None, n_jobs = None, n_neighbors = 13, p = 2, weights =` uniform`) 

Code: forecast score

knn.score (X_test, y_test)

Exit:

 0.9627659574468085  

Code: doing cross validation

neighbors = []

cv_scores = []

 

from sklearn.model_selection import cross_val_score

# do 10x cross-validation

for k in range ( 1 , 51 , 2 ):

neighbors.append (k)

  knn = KNeighborsClassifier (n_neighbors = k)

scores = cross_val_score (

knn, X_train, y_train, cv = 10 , scoring = `accuracy` )

  cv_scores.append (scores.mean ())

Code: classification error compared to k

MSE = [ 1 - x for x in cv_scores]

 
# determining the best k

optimal_k = neighbors [MSE.index ( min (MSE))]

print ( ` The optimal number of neighbors is% d ` % optimal_k)

 
# error of misclassification of the graph compared to k

plt.figure (figsize = ( 10 , 6 ))

plt.plot (neighbors, MSE)

plt.xlabel ( `Number of neighbors` )

plt.ylabel ( ` Misclassification Error` )

plt.show ()

Exit :

 The optimal number of neighbors is 13