ML | Active learning

Where should we apply active learning?

  1. We have very little data or a huge amount of data.
  2. Annotated dataset annotation is worth human effort, time and money.
  3. We have access to limited computing power.

example

On a certain planet has different fruits of different sizes (1-5), some of them are poisonous, while others are not. The only criterion for deciding whether a fruit is poisonous or not is its size. Our task — prepare a classifier that predicts whether a given fruit is poisonous or not. The only information we have is that size 1 fruits are not poisonous, size 5 fruits are poisonous, and after a certain size all fruits are poisonous.

The first approach is to check each fruit size, which is time and resource consuming. 
Second approach — apply binary search and find the transition point (solution boundary). This approach uses less data and gives the same results as linear search.

  General Algorithm:  1. train classifier with the initial training dataset 2.calculate the accuracy 3. while (accuracy & lt; desired accuracy): 4.select the most valuable data points (in general points close to decision boundary) 5.query that data point / s (ask for a label) from human oracle 6.add that data point / s to our initial training dataset 7.re-train the model 8.re-calculate the accuracy 

Suitable for active learning algorithm
1. Synthesis Query

  • Usually this approach is used when we have a very small dataset.
  • In this approach, we select any undefined point from a given n-dimensional space … we don`t care if this dot exists.
  • This query synthesis can select any point (valuable) from the 3 * 3 2-D plane.

  • Someday it would be difficult for a human oracle to comment on the requested data point.
  • These are some queries generated by the Query Synthesis approach for a model prepared for handwriting recognition. It is very difficult to annotate these requests.

2. Sampling

  • This approach is used when we have a large dataset.
  • In this approach, we split our dataset into three parts: the training set; Test set; Unlabeled pool (ironically) [5%; 25%, 70%].
  • This training dataset is our initial dataset and is used for the initial training of our model.
  • This approach selects points of value / uncertainty from this untagged pool, this ensures that the human oracle can recognize the entire request
  • The black dots represent the unlabeled pool and the merged red, green dots — training dataset.

    Here is an active learning model that solves valuable questions based on the likelihood of a point in the classroom. In here .

    import numpy as np

    import pandas as pd

    from statistics import mean

    from sklearn.impute import SimpleImputer

    from sklearn.preprocessing import StandardScaler

    from sklearn.linear_model import LogisticRegression

    from sklearn.model_selection import train_test_split

      

      
    # split the dataset into test case, train set and pool unmarking

    def split (dataset, train_size, test_size):

    x = dataset [:,: - 1 ]

    y = dataset [:, - 1 ]

    x_train, x_pool, y_train, y_pool = train_test_split (

    x, y, train_size = train_size)

    unlabel, x_test, label, y_test = train_test_split (

    x_pool, y_pool, test_size = test_size)

      return x_train, y_train, x_test, y_test, unlabel, label

     

     

    if __ name__ = = ` __main__` :

    # read the set data

    dataset = pd.read_csv ( "./ spambase.csv" ). values ​​[:, ]

     

    # imputing missing data

      imputer = SimpleImputer (missing_values ​​ = 0 , strategy = "mean" )

    imputer = imputer.fit (dataset [:,: - 1 ])

    dataset [:,: - 1 ] = imputer.transform (dataset [:,: - 1 ])

     

    # extortion function

    sc = StandardScaler ()

    dataset [:,: - 1 ] = sc.fit_transform (dataset [:,: - 1 ])

     

      # run both models 100 times and take the average of their accuracy

    ac1, ac2 = [], []  # arrays to store the accuracy of different models

     

    for i in range ( 100 ):

      # split dataset by train (5%), test (25%), no label (70%)

    x_train, y_train, x_test, y_test, unlabel, label = split (

    dataset, 0.05 , 0.25 )

     

    # train model active learning

    for i in range ( 5 ):

    clas sifier1 = LogisticRegression ()

    classifier1.fit (x_train, y_train)

    y_probab = classifier1.predict_proba (unlabel) [:, 0 ]

    p = 0.47 # uncertainty range from 0.47 to 0.53

    uncrt_pt_ind = []

      for i in range (unlabel.s hape [ 0 ]):

    if (y_probab [i] & gt; = p and y_probab [i] & lt; = 1 - p):

    uncrt_pt_ind.append (i)

      x_train = np.append (unlabel [uncrt_pt_ind,:], x_train, axis = 0 )

    y_train = np.append (label [uncrt_pt_ind], y_train)  

    unlabel = np.delete (unlabel, uncrt_pt_ind, axis = 0 )

    label = np.delete (label, uncrt_pt_ind)

    classifier2 = LogisticRegression ()

    classifier2.fit (x_train, y_train)

    ac1.append (classifier2.score (x_test, y_test))

     

      & # 39; & # 39; & # 39; split dataset by train (same as generated by our model),

    test (25%), no label (rest)

    train_size = x_train.shape [ 0 ] / dataset.shape [ 0 ]

    x_train, y_train, x_test, y_test, unlabel, label = split (

    dataset, train_size, 0.25 )

     

    # model p Driving without active learning

    classifier3 = LogisticRegression ()

    classifier3.fit (x_train, y_train)

    ac2.append (classifier3.score (x_test, y_test))

     

    print ( " Accuracy by active model: " , mean ( ac1) * 100 )

    print ( " Accuracy by random sampling: " , mean (ac2) * 100 )

     
    "" "
    This code is provided by Raghav Dalmia
    https://github.com / raghav-dalmia
    "" "

 Output: Accuracy by active model: 80.7 Accuracy by random sampling: 79.5 

There are several models for choosing the most valuable glasses. Some of them are:

  1. Committee Request
  2. Query synthesis and nearest neighbor search
  3. Large margin heuristic
  4. Back Probability Heuristic

Link: Synthesis of Active Learning lectures on artificial intelligence and machine learning. By Burr S.