Machine learning for anomaly detection



An anomaly can be broadly divided into three categories —

  1. Point anomaly: A tuple in a dataset is called a point anomaly if it is far from the rest of the data.
  2. Contextual anomaly: an observation is a contextual anomaly if it is an anomaly due to the observation context.
  3. Collective anomaly: helps to find the anomaly.

Anomaly detection can be done using machine learning concepts. This can be done in the following ways —

  1. Detection of monitored anomalies.  This method requires a labeled dataset containing both normal and anomalous samples to build a predictive model to classify future data points. The most commonly used algorithms for this purpose are — these are controlled neural networks,

    import numpy as np

    from scipy import stats

    import matplotlib.pyplot as plt

    import matplotlib.font_manager

      from pyod.models.knn import KNN 

    from pyod.utils.data import generate_data, get_outliers_inliers

    Step 2: Create Synthetic Data

    # generate a random dataset with two functions

    X_train, y_train = generate_data (n_train = 300 , train_only = True ,

    n_features = 2 )

     
    # Setting the percentage of emissions

    outlier_fraction = 0.1

     
    # Storing outliers and contributors in different arrays

    X_outliers, X_inliers = get_outliers_inliers (X_train, y_train)

    n_inliers = len (X_inliers)

    n_outliers = len (X_outliers )

     
    # Separate the two functions

    f1 = X_train [:, [ 0 ]]. Reshape ( - 1 , 1 )

    f2 = X_train [: , [ 1 ]]. reshape ( - 1 , 1 )

    Step 3: Data Visualization

    # Dataset rendering
    # create grid

    xx, yy = np.meshgrid (np.linspace ( - 10 , 10 , 200 ),

      np.linspace ( - 10 , 10 , 200 ))

     
    # dot plot
    plt.scatter (f1, f2)

    plt.xlabel ( `Feature 1` )

    plt.ylabel ( `Feature 2` )

    Step 4: Train and evaluate the model

    # Train the classifier

    clf = KNN (contamination = outlier_fraction)

    clf.fit (X_train, y_train)

     
    # You can print this to see all prediction scores

    scores_pred = clf.decision_function (X_train) * - 1

     

    < code class = "plain"> y_pred = clf.predict (X_train)

    n_errors = (y_pred! = y_train). sum ()

    # Counting the number of errors

     

    print ( `The number of prediciton errors are` + str (n_errors))

    Step 5: Rendering predictions

    # threshold for consideration
    # datapoint inlier or outlier

    threshold = stats.scoreatpercentile (scores_pred, 100 * outlier_fraction)

     
    # solver calculates the raw
    # anomaly score for each point

    Z = clf.decision_function (np.c_ [xx.ravel (), yy.ravel () ]) * - 1

    Z = Z.reshape (xx.shape)

     
    # fill in the blue color map from the minimum anomaly
    # score to the threshold

    subplot = plt.subplot ( 1 , 2 , 1 )

    subplot.contourf (xx, yy, Z, levels = np.linspace (Z. min (), 

    threshold, 10 ), cmap = plt.cm.Blues_r)

     
    # draw a red outline where the anomaly is
    # grade equals threshold

    a = subplot.contour (xx, yy, Z, levels = [threshold],

    linewidths = 2 , colors = `red` )

     
    # fill in orange contour lines where anomaly range
    # score from threshold to maximum score for anomaly

    subplot.contourf (xx, yy, Z, levels = [threshold, Z. max ()], colors = `orange ` )

      
    # scatter plots of slots with white dots

    b = subplot.scatter (X_train [: - n_outliers, 0 ], X_train [: - n_outliers, 1 ],

    c = ` white` , s = 20 , edgecolor = `k`

      
    # dot graph to outliers with black dots

    c = subplot.scatter (X_train [ - n_outliers :, 0 ], X_train [ - n_outliers :, 1 ], 

      c = `black` , s = 20 , edgecolor = `k` )

    subplot.axis ( `tight` )

      
    subplot.legend (

      [a.collections [ 0 ] , b, c],

    [ `learned decision function` , ` true inliers` , `true outliers` ],

      prop = matplotlib.font_manager.FontProperties (size = 10 ),

    loc = ` lower right` )

     

    subplot.set_title ( ` K-Nearest Neighbors` )

    subplot.set_xlim (( - 10 , 10 ))

    subplot.set_ylim (( - 10 , 10 ))

    plt.show () 

    Link: https://www.analyticsvidhya.com/blog/2019/02/outlier-detection- python-pyod /