  # ML | Active learning

Arrays | NumPy | Python Methods and Functions

Where should we apply active learning?

1. We have very little data or a huge amount of data.
2. Annotated dataset annotation is worth human effort, time and money.

example

On a certain planet has different fruits of different sizes (1-5), some of them are poisonous, while others are not. The only criterion for deciding whether a fruit is poisonous or not is its size. Our task — prepare a classifier that predicts whether a given fruit is poisonous or not. The only information we have is that size 1 fruits are not poisonous, size 5 fruits are poisonous, and after a certain size all fruits are poisonous.

The first approach is to check each fruit size, which is time and resource consuming.
Second approach — apply binary search and find the transition point (solution boundary). This approach uses less data and gives the same results as linear search.

`  General Algorithm:  1. train classifier with the initial training dataset 2.calculate the accuracy 3. while (accuracy & lt; desired accuracy): 4.select the most valuable data points (in general points close to decision boundary) 5.query that data point / s (ask for a label) from human oracle 6.add that data point / s to our initial training dataset 7.re-train the model 8.re-calculate the accuracy `

Suitable for active learning algorithm
1. Synthesis Query

• Usually this approach is used when we have a very small dataset.
• In this approach, we select any undefined point from a given n-dimensional space ... we don`t care if this dot exists.
• This query synthesis can select any point (valuable) from the 3 * 3 2-D plane.

• Someday it would be difficult for a human oracle to comment on the requested data point.
• These are some queries generated by the Query Synthesis approach for a model prepared for handwriting recognition. It is very difficult to annotate these requests.

2. Sampling

• This approach is used when we have a large dataset.
• In this approach, we split our dataset into three parts: the training set; Test set; Unlabeled pool (ironically) [5%; 25%, 70%].
• This training dataset is our initial dataset and is used for the initial training of our model.
• This approach selects points of value / uncertainty from this untagged pool, this ensures that the human oracle can recognize the entire request
• The black dots represent the unlabeled pool and the merged red, green dots — training dataset.

Here is an active learning model that solves valuable questions based on the likelihood of a point in the classroom. In here .

 ` import ` ` numpy as np ` ` import ` ` pandas as pd ` ` from ` ` statistics ` ` import ` ` mean ` ` from ` ` sklearn.impute ` ` import ` ` SimpleImputer ` ` from ` ` sklearn.preprocessing ` ` import ` ` StandardScaler ` ` from ` ` sklearn.linear_model ` import ` LogisticRegression ` ` from ` ` sklearn.model_selection ` ` import ` ` train_test_split ` ` `  ` `  ` # split the dataset into test case, train set and pool unmarking ` ` def ` ` split (dataset, train_size, test_size): ` ` x ` ` = ` ` dataset [:,: ` ` - ` ` 1 ` `] ` ` y ` ` = ` ` dataset [:, ` ` - ` ` 1 ` ] ` x_train, x_pool, y_train, y_pool ` ` = ` ` train_test_split (` ` x, y, train_size ` ` = ` ` train_size) ` ` unlabel, x_test, label, y_test ` ` = ` ` train_test_split (` ` x_pool, y_pool, test_size ` ` = ` ` test_size) ` ` ` ` return ` ` x_train, y_train, x_test, y_test, unlabel, label `     ` if __ name__ = = ` __main__` : `` # read the set data dataset = pd.read_csv ( "./ spambase.csv" ). values ​​[:, ]   # imputing missing data   imputer = SimpleImputer (missing_values ​​ = 0 , strategy = "mean" ) imputer = imputer.fit (dataset [:,: - 1 ]) dataset [:,: - 1 ] = imputer.transform (dataset [:,: - 1 ])   # extortion function sc = StandardScaler () dataset [:,: - 1 ] = sc.fit_transform (dataset [:,: - 1 ])     # run both models 100 times and take the average of their accuracy ac1, ac2 = [], []  # arrays to store the accuracy of different models   for i in range ( 100 ):   # split dataset by train (5%), test (25%), no label (70%) x_train, y_train, x_test, y_test, unlabel, label = split ( dataset, 0.05 , 0.25 )   # train model active learning for i in range ( 5 ): clas sifier1 = LogisticRegression () classifier1.fit (x_train, y_train) y_probab = classifier1.predict_proba (unlabel) [:, 0 ] p = 0.47 # uncertainty range from 0.47 to 0.53 uncrt_pt_ind = []   for i in range (unlabel.s hape [ 0 ]): if (y_probab [i] & gt; = p and y_probab [i] & lt; = 1 - p): uncrt_pt_ind.append (i)   x_train = np.append (unlabel [uncrt_pt_ind,:], x_train, axis = 0 ) y_train = np.append (label [uncrt_pt_ind], y_train)   unlabel = np.delete (unlabel, uncrt_pt_ind, axis = 0 ) label = np.delete (label, uncrt_pt_ind) classifier2 = LogisticRegression () classifier2.fit (x_train, y_train) ac1.append (classifier2.score (x_test, y_test))     & # 39; & # 39; & # 39; split dataset by train (same as generated by our model), test (25%), no label (rest) train_size = x_train.shape [ 0 ] / dataset.shape [ 0 ] x_train, y_train, x_test, y_test, unlabel, label = split ( dataset, train_size, 0.25 )   # model p Driving without active learning classifier3 = LogisticRegression () classifier3.fit (x_train, y_train) ac2.append (classifier3.score (x_test, y_test))   print ( " Accuracy by active model: " , mean ( ac1) * 100 ) print ( " Accuracy by random sampling: " , mean (ac2) * 100 )   "" " This code is provided by Raghav Dalmia https://github.com / raghav-dalmia "" " `

` Output: Accuracy by active model: 80.7 Accuracy by random sampling: 79.5 `

There are several models for choosing the most valuable glasses. Some of them are: