ML | Additional tree classifier for selecting objects

File handling | NumPy | Python Methods and Functions

Extra Trees Classifier — it is an ensemble learning technique that combines the results of several uncorrelated decision trees collected in a “forest” to produce classification results. It is very similar in concept to the random forest classifier and differs from it only in the way it builds decision trees in the forest.

Each Decision Tree in the Additional Trees Forest is built from the original training pattern. Then, at each test node, each tree is given a random sample of k features from a set of functions, from which each decision tree must choose the best feature to partition the data based on some mathematical criteria (usually the Gini index). This random sample of features results in many uncorrelated decision trees.

To perform feature selection using the above forest structure, when constructing the forest, for each feature, the overall decrease in the mathematical criteria used to determine the partition function (index Gini, if the Gini index is used when constructing the forest) is calculated. This value is called Gini Function Importance. To perform feature selection, each feature is ordered in descending order according to the Gini importance of each feature, and the user selects the best k features according to their choice.

Consider the following data:

Let`s build a hypothetical Additional Trees Forest for the above data with five decision trees and a value of k , which decides that the number of objects in a random sample of objects is two . Here, information collection will be used as a decision-making criterion. First, let`s calculate the entropy of the data. Pay attention to the formula for calculating entropy:

where c — the number of unique class labels and the proportion of rows with an output label is i.

Therefore, for data data entropy :

Let the decision trees be built so that:

  • 1st decision tree receives data from Outlook and Temperature functions:

    Note that the formula for getting information is:

    So

         

    Also:

       
  • 2nd decision tree receives data with Temperature and Wind functions:

    Using the above formulas: —

         
  • strong & gt; 3rd decision tree gets data with Outlook and Humidity functions:
         
  • 4th decision tree receives data with Temperature and Humidity functions:
         
  • 5th decision tree receives data with Wind and Humidity functions:
         

    Calculate general information ion gain for each function: —

      Total Info Gain for Outlook = 0.246 + 0.246 = 0.492 Total Info Gain for Temperature = 0.029 + 0.029 + 0.029 = 0.087 Total Info Gain for Humidity = 0.151 + 0.151 + 0.151 = 0.453 Total Info Gain for Wind = 0.048 + 0.048 = 0.096  
  • Thus, the most important variable for determining the output label in accordance with the forest built above is additional trees is the Outlook function.

    The code below demonstrates how to select objects using additional tree classifiers.

    Step 1: Import required libraries

    import pandas as pd

    import numpy as np

    import matplotlib.pypl ot as plt

    from sklearn.ensemble import ExtraTreesClassifier

    Step 2: Loading and Clearing Data

    # Change workplace to file location
    cd C: UsersDevDesktopKaggle

     
    # Loading data

    df = pd.read_csv ( `data.csv` )

      
    # Separate dependent and independent variables

    y = df [ `Play Tennis` ]

    X = df.drop ( ` Play Tennis` , axis = 1 )

     
    X.head ()

    Step 3: Build a forest of additional trees and calculate the values ​​of individual functions

    # Building the model

    extra_tree_forest = ExtraTreesClassifier (n_estimators = 5 ,

    criterion = ` entropy` , max_features = 2 )

     
    # Train the model
    extra_tree_forest.fit (X, y)

      
    # Calculate the importance of each features

    feature_importance = extra_tree_forest.feature_importances_

     
    # Normalizing individual values ​​

    feature_importance_normalized = np.std ([tree.feature_importances_ for tree in  

    extra_tree_forest.estimators_],

    axis = 0 )

    Step 4: Visualize and compare results

    # Building a histogram for comparing models
    plt.bar (X.columns, feature_importance_normalized)

    plt.xlabel ( ` Feature Labels` )

    plt.ylabel ( `Feat ure Importances` )

    plt.title ( `Comparison of different Feature Importances` )

    plt.show ()

    Thus, the above output confirms our theory of object selection using the Extra Trees Classifier. The importance of objects can have different meanings due to the random nature of the samples of objects.





    Get Solution for free from DataCamp guru