ML | Logistic regression v / s Decision tree classification



We can compare two algorithms in different categories —

Criteria Logistic Regression Decision Tree Classification
Interpretability Less interpretable More interpretable
Decision Boundaries Linear and single decision boundary Bisects the space into smaller spaces
Ease of Decision Making A decision threshold has to be set Automatically handles decision making
Overfitting Not prone to overfitting Prone to overfitting
Robustness to noise Robust to noise Majorly affected by noise
Scalability Requires a large enough training set Can be train ed on a small training set

As a simple experiment, we run two models on the same dataset and compare their characteristics.

Step 1: Import the required libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeC lassifier

Step 2: Read and clear the dataset

cd C: UsersDevDesktopKaggleSinking Titanic
# Change workplace to file location

df = pd.read_csv ( `_train.csv` )

y = df [ `Survived` ]

  

X = df.drop ( `Survived` , axis = 1 )

X = X.drop ([ `Name` , `Ticket` , ` Cabin` , `Embarked` ], axis = 1 )

 

X = X.replace ([ ` male` , `female` ], [ 2 , 3 ])

# Hot coding categorical variables

  

X.fillna (method = ` ffill` , inplace = True )

# Handling missing values ​​

Step 3: Train and evaluate the Logisitc regression model

X_train, X_test, y_train, y_test = train_test_split (

X, y, test_size = 0.3 , random_state = 0 )

  

lr = LogisticRegression ( )

lr.fit (X_train, y_train)

print (lr.score (X_test, y_test))

Step 4: Train and evaluate the decision tree classifier model

criteria = [ `gini` , ` entropy` ]

scores = { }

 

for c in criteria:

dt = DecisionTreeClassifier (criterion = c)

  dt.fit (X_train, y_train)

  test_score = dt.score (X_test, y_test)

scores = test_score

 

print ( scores)

Comparing the scores, we see that the logistic regression model performed better in the current dataset, but this may not always be the case.