Anomaly detection is to find data points that deviate from the specification. In other words, these points do not conform to the expected pattern. Outliers and exceptions are terms used to describe abnormal data. Anomaly detection is important in all areas because it provides valuable and actionable insights. For example, abnormalities in MRI scans may indicate the presence of tumor areas in the brain, while abnormal readings from factory sensors may indicate component damage.

After completing this tutorial, Xiao Pu hopes you can understand:

- Define and understand anomaly detection.
- An anomaly detection algorithm is implemented to analyze and interpret the results.
- View hidden patterns in any data that may cause abnormal behavior.

Let's start.

## What is anomaly detection?

An outlier is just a data point, which deviates greatly from other data points in a specific data set. Similarly, anomaly detection is a process that helps us identify data outliers, or points that deviate greatly from a large number of other data points.

When large datasets are involved, they may contain very complex patterns that cannot be detected by simply viewing the data. Therefore, in order to realize the key application of machine learning, the research of anomaly detection is of great significance.

## Exception type

In the field of data science, we have three different methods to classify anomalies. Understanding them correctly can have a significant impact on how you handle exceptions.

- Point or global exception: corresponding to data points that are significantly different from other data points, global exception is considered to be the most common form of exception. Generally, global anomalies are found to be far from the average or median of any data distribution.
- Context or condition exceptions: the values of these exceptions are very different from those of other data points in the same context. An exception in one dataset may not be an exception in another dataset.
- Collective exception: exception objects that are closely clustered due to the same exception characteristics are called collective exceptions. For example, your server is not attacked every day, so it will be regarded as an outlier.

Although there are many techniques for exception detection, let's implement some to see how they can be used in various use cases.

## Isolated forest

Like random forests, isolated forests are built using decision trees. They are implemented unsupervised because there are no predefined tags. The design concept of isolated forest is that anomalies are "few and different" data points in the data set.

Recall that decision trees are built using information criteria such as Gini index or entropy. Significantly different groups are separated at the root of the tree, and more subtle differences are identified deeper in the branches. Based on the randomly selected features, the isolated forest processes the random secondary sampling data in a tree structure. Samples that go deep into the tree and require more cutting to separate them are rarely likely to be abnormal. Similarly, samples found on the shorter branches of the tree are more likely to be anomalies because tree discovery makes it easier to distinguish them from other data.

In this meeting, we will implement the isolation forest in Python to see how it detects exceptions in the dataset. We all know the incredible scikit learn API, which provides various APIs for implementation. Therefore, we will use it to apply isolated forest to prove its effectiveness in anomaly detection.

First, let's load the necessary libraries and packages.

from sklearn.datasets import make_blobs from numpy import quantile, random, where from sklearn.ensemble import IsolationForest import matplotlib.pyplot as plt

### Data preparation

We will use make_blob() function to create a dataset containing random data points.

random.seed(3) X, _ = make_blobs(n_samples=300, centers=1, cluster_std=.3, center_box=(20, 5))

Let's visualize the dataset graph to see randomly separated data points in the sample space.

plt.scatter(X[:, 0], X[:, 1], marker="o", c=_, s=25, edgecolor="k")

### Define and fit isolated forest models for prediction

As mentioned earlier, we will use the classes in the isolationforestscikit learn API to define our model. In the class parameters, we will set the number of estimators and pollution values. Then we will use this fit_ The predict () function obtains the prediction of the data set by fitting it to the model.

IF = IsolationForest(n_estimators=100, contamination=.03) predictions = IF.fit_predict(X)

Now, let's extract negative values as outliers and draw the result by highlighting the outliers in color.

outlier_index = where(predictions==-1) values = X[outlier_index] plt.scatter(X[:,0], X[:,1]) plt.scatter(values[:,0], values[:,1], color='y') plt.show()

Put all this together, here's the complete code:

from sklearn.datasets import make_blobs from numpy import quantile, random, where from sklearn.ensemble import IsolationForest import matplotlib.pyplot as plt random.seed(3) X, _ = make_blobs(n_samples=300, centers=1, cluster_std=.3, center_box=(20, 5)) plt.scatter(X[:, 0], X[:, 1], marker="o", c=_, s=25, edgecolor="k") IF = IsolationForest(n_estimators=100, contamination=.03) predictions = IF.fit_predict(X) outlier_index = where(predictions==-1) values = X[outlier_index] plt.scatter(X[:,0], X[:,1]) plt.scatter(values[:,0], values[:,1], color='y') plt.show()

## Kernel density estimation

If we think that the norm of the data set should be suitable for a certain probability distribution, then anomalies are those anomalies that we should rarely see, or with a very low probability. Kernel density estimation is a technique for randomly estimating the probability density function of data points in sample space. Using the density function, we can detect anomalies in the data set.

To achieve this, we will prepare the data by creating a uniform distribution, and then apply the classes in the kerneldensityscikit learn library to detect outliers.

First, we will load the necessary libraries and packages.

from sklearn.neighbors import KernelDensity from numpy import where, random, array, quantile from sklearn.preprocessing import scale import matplotlib.pyplot as plt from sklearn.datasets import load_boston

### Prepare and plot data

Let's write a simple function to prepare the data set. The randomly generated data will be used as the target data set.

random.seed(135) def prepData(N): X = [] for i in range(n): A = i/1000 + random.uniform(-4, 3) R = random.uniform(-5, 10) if(R >= 8.6): R = R + 10 elif(R < (-4.6)): R = R +(-9) X.append([A + R]) return array(X) n = 500 X = prepData(n)

Let's visualize the plot to examine the dataset.

x_ax = range(n) plt.plot(x_ax, X) plt.show()

### Prepare and fit the kernel density function for prediction

We will use the scikit learn API to prepare and fit the model. Then use score_ The sample() function gets the score of the sample in the dataset. Next, we will use the quantile() function to get the threshold.

kern_dens = KernelDensity() kern_dens.fit(X) scores = kern_dens.score_samples(X) threshold = quantile(scores, .02) print(threshold)

-5.676136054971186

Samples whose detection score is equal to or lower than the obtained threshold are then visualized with color highlighted anomalies:

idx = where(scores <= threshold) values = X[idx] plt.plot(x_ax, X) plt.scatter(idx,values, color='r') plt.show()

Put all this together, here's the complete code:

from sklearn.neighbors import KernelDensity from numpy import where, random, array, quantile from sklearn.preprocessing import scale import matplotlib.pyplot as plt from sklearn.datasets import load_boston random.seed(135) def prepData(N): X = [] for i in range(n): A = i/1000 + random.uniform(-4, 3) R = random.uniform(-5, 10) if(R >= 8.6): R = R + 10 elif(R < (-4.6)): R = R +(-9) X.append([A + R]) return array(X) n = 500 X = prepData(n) x_ax = range(n) plt.plot(x_ax, X) plt.show() kern_dens = KernelDensity() kern_dens.fit(X) scores = kern_dens.score_samples(X) threshold = quantile(scores, .02) print(threshold) idx = where(scores <= threshold) values = X[idx] plt.plot(x_ax, X) plt.scatter(idx,values, color='r') plt.show()