Python | CAP — cumulative accuracy profile analysis

File handling | NumPy | Python Methods and Functions

The CAP, commonly referred to as the Cumulative Accuracy Profile, is used to evaluate the performance of a classification model. This helps us understand and conclude about the robustness of the classification model. To visualize this, our graph depicts three different curves:

  1. Random plot
  2. Plot generated using SVM classifier or random forest classifier
  3. Ideal plot (ideal line)

We are working with data to understand concept.

Code: load dataset.

# import libraries

import pandas as pd

import seaborn as sns

import matplotlib. pyplot as plt

import numpy as np

 
# loading dataset

data = pd.read_csv ( `C: Users DELL Desktop Social_Network_Ads .csv` )

 

print ( "Data Head:" , data .head ())

Output:

 Data Head: User ID Gender Age Estimated Salary Purchased 0 15624510 Male 19 19000 0 1 15810944 Male 35 20000 0 2 15668575 Female 26 43000 0 3 15603246 Female 27 57000 0 4 15804002 Male 19 76000 0  

Code: data input / output.

# Input and output

x = data.iloc [:, 2 : 4 ]

y = data.iloc [:, 4 ]

 

print ( "Input:" , x.iloc [ 0 : 10 ,:])

Exit:

 Input: Age EstimatedSalary 0 19 19000 1 35 20000 2 26 43000 3 27 57000 4 19 76000 5 27 58000 6 27 84000 7 32 150000 8 25 33000 9 35 65000 

  Code: Split dataset for training and testing.

# data splitting

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split (

x, y, test_size = 0.3 , random_state = 0 )

Code: random forest classifier

# classifier

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier (n_estimators = 400 )

 
# advanced training
classifier.fit (x_train, y_train)

  
#predict

pred = classifier.predict (x_test)

Code: Finding the accuracy classifier.

# Performance Model

from sklearn.metrics import accuracy_score

print ( " Accuracy : " , accuracy_score (y_test, pred) *   100 )

Output:

 Accuracy: 91.66666666666666 

Random model

the number of points in the range from 0 to the total number of data points in the dataset. The y-axis was stored as the total number of points for which the dependent variable from our dataset has a result of 1. A random plot can be understood as a linearly increasing relationship. An example is a model that predicts whether a product was purchased (positive result) by each person in a group of people (classifying parameter) based on factors such as their gender, age, income, etc. If group members are contacted randomly, the total number of products sold will grow linearly to a maximum value corresponding to the total number of buyers in the group. This distribution is called a "random" CAP

Code: random model

# code for random plot

import matplotlib.pyplot as plt

import numpy as np

 
# test data length

total = len (y_test)

 
# Counting 1 marks in test data

one_count = np. sum (y_test)

 
# counting tags & # 39; 0 & # 39; in test data

zero_count = total - one_count

 

plt.figure (figsize = ( 10 , 6 ))

 
# x-axis ranges from 0 to the total number of people contacted
# Y-axis ranges from 0 to total positives.

 

plt.plot ([ 0 , total], [ 0 , one_count], c = `b`

  linestyle = ` --` , label = `Random Model` )

plt.legend ()

Output:

Random forest classifier line

Code: The random forest classification algorithm is applied to the set data for line plot of random classifier .

lm = [y for _, y in sorted ( zip (pred, y_test ), reverse = True )]

x = np.arange ( 0 , total + 1 )

y = np.append ([ 0 ], np.cumsum (lm))

plt.plot (x, y, c = `b` , label = `Random classif ier` , linewidth = 2 )

Output:

Explanation: pred — it is a prediction made by a random classifier. We pack the predicted and test values ​​and sort them in reverse order so that the higher values ​​come first and then the lower values. We only extract the y_test values ​​from the array and store them in lm np.cumsum () creates an array of values ​​by cumulatively adding all previous values ​​in the array to the current value. The x values ​​will range from 0 to a total of +1. We are adding one to the common reason arange () does do not include one in the array and we want the x-axis to be in the range 0 to the grand total.

Ideal model

Then we build the ideal plot (or ideal line). An accurate forecast determines exactly which group members will buy a product so that the maximum number of products sold will be reached with the minimum number of calls. The result is a curve on the CAP curve that stays flat after reaching the maximum (contact with everyone else in the group will not increase sales), which is a “perfect” CAP .

plt.plot ([ 0 , one_count, total], [ 0 , one_count, one_count],

c = `gray` , linewidth = 2 , label = `Perfect Model` )

Output:

Explanation: the ideal model finds positive results in the same number of attempts as and the number of positive results. In our dataset, there are only 41 positive results, and therefore in exactly 41 the maximum is reached.

FINAL ANALYSIS:

In any case, our classifier algorithm should not create the line that lies under the random line. This is considered a really bad model in this case. Since the plotted line of the classifier is close to the ideal line, we can say that our model fits really well. Take an area under the ideal area and name it AP. Take the area under the forecasting model and name it AP . Then take the ratio as aR / aP . This ratio is called Accuracy . The closer the value is to 1, the better the model. This is one way to analyze it.

Another way to analyze — project a line about 50% off the axis in the prediction model and project it onto the y-axis. Let`s say we get the projection value as X%.

 - & gt; 60%: it is a really bad model - & gt; 60% & lt; X & lt; 70%: it is still a bad model but better than the first case obviously - & gt; 70% & lt; X & lt; 80%: it is a good model - & gt; 80% & lt; X & lt; 90%: it is a very good model - & gt; 90% & lt; X & lt; 100%: it is extraordinarily good and might be one of the overfitting cases. 

Thus, according to this analysis, we can determine how accurate our model is. 
Link: — wikipedia.org





Get Solution for free from DataCamp guru