We know what outliers are – the data points which lie outside of where most of our data lies.
When dealing with Outliers, it is relatively straightforward to find outliers in a uni-dimensional setting where we could do a box plot and find potential outliers.
Things get a little complicated when we go multi-dimensional. In multi-dimensional data, if we perform a uni-dimensional outlier analysis using a box-plot on each individual dimension, we might not have a concrete answer because the same data point might be a potential outlier in a particular dimension and not an outlier in another dimension.
To overcome this we do something like Mahalanobis or MCD which would hold good a lot of times.
But now, in the big data era where companies are storing huge chunks of multi-dimensional data, the traditional multivariate outlier techniques start to show their drawbacks.
Most of the outlier techniques rely on distance computations, If the data points distance is large from rest of the data points then it is deemed to be an outlier, for high-dimensional data the concept of distance becomes hazy.
The haziness is due to the well know bottleneck in machine learning called “The curse of dimensionality”, as the number of dimensions increases the concept of distance makes lesser sense as all data points are so far off in the euclidean space that the relative difference in the distance doesn’t matter much.
This can be mathematically described by the below formula where the relative difference between the farthest point and the nearest point converges to 0 for increasing dimensionality d:
This means the discrimination between the nearest and the farthest neighbour becomes rather poor in high dimensional space.
This forms as the basis for the algorithm that we are going to discuss called ABOD which stands for Angle Based Outlier Detection, this algorithm finds potential outliers by considering the variances of the angles between the data points. It still takes distances into account, but only as a secondary measure to normalize the results.
The general idea is pretty simple, let us consider a simple data set with these values.
Let’s plot the points
We can see that the “11th” point with index 10 is clearly an outlier.
Now let us visualize the angles for a datapoint which is an “inlier” and observe the variance of the angles between the inlier point and all the remaining points:
Now, Let’s visualize angles another point:
This even though is not at the centre like the previous one but is at the border of where most points are lying. The angles for this point too is spread in all directions but the variance is not as large as the previous one.
Lastly coming to an outlier’s angles:
We can clearly notice that the angles are much narrower as all of them pointing in the same direction. Hence here the variance of the angles is much smaller.
Hence, the spectrum of angles to pairs of points remains rather small for an outlier whereas the variance of angles is higher for border points of a cluster and very high for inner points in a cluster of points.
When the spectrum of angles from a point to all other points is broad that means it is surrounded by points in all possible directions, so it is inside a cluster of points
When the spectrum of angles from a point to all other points is small that means other points are positioned in certain directions only.
This means that the point is positioned outside of some sets of points that are grouped together. Thus it is an outlier.
How is it measured quantitatively?
ABOD uses ABOF(Angle Based Outlier Factor) to determine the outlierness of a point.
Let’s assume 3 points (A, B, C ) in the database D.
The angle-based outlier factor of a point A i.e. ABOF(A) is the variance of the angles between the point A to all other points in the dataset D
In the formula we can see that we not only calculate the angle between the vectors in the numerator but also that they are normalized by the quadratic product of the length of the difference vectors in the denominator, i.e. the angle is weighted less if the corresponding points are far from the query point.
With this weighting factor in the denominator, the distance does influence the ABOF value, but only to a minor extent. This weighting factor is important because the angle to a pair of points varies naturally stronger for bigger distances. If we compute the variance of this value over all pairs of points from the query point A then we get angle-based outlier factor i.e. ABOF of A.
As the main criterion is the variance of angles and distance only acts as the weight, ABOD is able to detect outliers even in high dimensional data where other purely distance-based approaches fail.
The python implementation is pretty straightforward, most of the outlier detection algorithms are available in this neat package called pyod.
Scale, Fit and Predict to find the outliers
We can notice that the 11th point is the only one detected as an outlier by ABOD which matches our intuition.