An anomaly can be broadly divided into three categories —
- Point anomaly: A tuple in a dataset is called a point anomaly if it is far from the rest of the data.
- Contextual anomaly: an observation is a contextual anomaly if it is an anomaly due to the observation context.
- Collective anomaly: helps to find the anomaly.
Anomaly detection can be done using machine learning concepts. This can be done in the following ways —
- Detection of monitored anomalies. This method requires a labeled dataset containing both normal and anomalous samples to build a predictive model to classify future data points. The most commonly used algorithms for this purpose are — these are controlled neural networks,
import
numpy as np
from
scipy
import
stats
import
matplotlib.pyplot as plt
import
matplotlib.font_manager
from
pyod.models.knn
import
KNN
from
pyod.utils.data
import
generate_data, get_outliers_inliers
Step 2: Create Synthetic Data
# generate a random dataset with two functions
X_train, y_train
=
generate_data (n_train
=
300
, train_only
=
True
,
n_features
=
2
)
# Setting the percentage of emissions
outlier_fraction
=
0.1
# Storing outliers and contributors in different arrays
X_outliers, X_inliers
=
get_outliers_inliers (X_train, y_train)
n_inliers
=
len
(X_inliers)
n_outliers
=
len
(X_outliers )
# Separate the two functions
f1
=
X_train [:, [
0
]]. Reshape (
-
1
,
1
)
f2
=
X_train [: , [
1
]]. reshape (
-
1
,
1
)
Step 3: Data Visualization
# Dataset rendering
# create grid
xx, yy
=
np.meshgrid (np.linspace (
-
10
,
10
,
200
),
np.linspace (
-
10
,
10
,
200
))
# dot plot
plt.scatter (f1, f2)
plt.xlabel (
’Feature 1’
)
plt.ylabel (
’Feature 2’
)
Step 4: Train and evaluate the model
# Train the classifier
clf
=
KNN (contamination
=
outlier_fraction)
clf.fit (X_train, y_train)
# You can print this to see all prediction scores
scores_pred
=
clf.decision_function (X_train)
*
-
1
< code class = "plain"> y_pred
=
clf.predict (X_train)
n_errors
=
(y_pred!
=
y_train).
sum
()
# Counting the number of errors
print
(
’The number of prediciton errors are’
+
str
(n_errors))
Step 5: Rendering predictions
# threshold for consideration
# datapoint inlier or outlier
threshold
=
stats.scoreatpercentile (scores_pred,
100
*
outlier_fraction)
# solver calculates the raw
# anomaly score for each point
Z
=
clf.decision_function (np.c_ [xx.ravel (), yy.ravel () ])
*
-
1
Z
=
Z.reshape (xx.shape)
# fill in the blue color map from the minimum anomaly
# score to the threshold
subplot
=
plt.subplot (
1
,
2
,
1
)
subplot.contourf (xx, yy, Z, levels
=
np.linspace (Z.
min
(),
threshold,
10
), cmap
=
plt.cm.Blues_r)
# draw a red outline where the anomaly is
# grade equals threshold
a
=
subplot.contour (xx, yy, Z, levels
=
[threshold],
linewidths
=
2
, colors
=
’red’
)
# fill in orange contour lines where anomaly range
# score from threshold to maximum score for anomaly
subplot.contourf (xx, yy, Z, levels
=
[threshold, Z.
max
()], colors
=
’orange ’
)
# scatter plots of slots with white dots
b
=
subplot.scatter (X_train [:
-
n_outliers,
0
], X_train [:
-
n_outliers,
1
],
c
=
’ white’
, s
=
20
, edgecolor
=
’k’
)
# dot graph to outliers with black dots
c
=
subplot.scatter (X_train [
-
n_outliers :,
0
], X_train [
-
n_outliers :,
1
],
c
=
’black’
, s
=
20
, edgecolor
=
’k’
)
subplot.axis (
’tight’
)
subplot.legend (
[a.collections [
0
] , b, c],
[
’learned decision function’
,
’ true inliers’
,
’true outliers’
],
prop
=
matplotlib.font_manager.FontProperties (size
=
10
),
loc
=
’ lower right’
)
subplot.set_title (
’ K-Nearest Neighbors’
)
subplot.set_xlim ((
-
10
,
10
))
subplot.set_ylim ((
-
10
,
10
))
plt.show ()
Link: https://www.analyticsvidhya.com/blog/2019/02/outlier-detection- python-pyod /