Anomalies a.k.a Outliers are patterns in data which do not conform to the data points near to it. Statistically, they are the data points that are deemed to belong to a different population. Timely detection of anomaly is very important due to the fact that anomalies in data translate to significant and actionable information in a wide variety of application domains. Anomaly detection can be very useful in-flight safety, intrusion detection, fraud detection, healthcare, stock market manipulation, event detection systems in sensor data, etc. From the business perspective, the outliers can be detected by analysing unexpected spikes, drops, trend changes and level shifts.
Time series data are one of the most important aspects of today’s world. Computer network traffic, healthcare reports such as ECG, flight safety, sales data of a company, economic growth of a country, banking transactions, stock prices are some of the important examples of time series data. Mostly, time-series data is said to have some seasonality, trend and random factors. Of which, seasonality and trend are easy to forecast, but it is the random component that makes time series data, a tough nut to crack. These random points are the major cause for the outliers.
Normal time-series do have low and high points. All of these Highs and lows does not make them an outlier, it is the underlying nature of these points which raises a red flag. In the stock market, the stock price hitting its high/low because of some insider trading or some fraudulent transaction, make these points outliers/anomalies. Sometimes the gap between two highs or two lows can result in raising a red flag. This type of outliers is called anomalous sub-sequence. For example, the sales for a particular store are always low in the JFM quarter, but for a particular year, the low was hit in the month of December and lasted till June. This extended time period of lows makes it an outlier. While in another case, a data-point can be anomalous in a specific contest but not otherwise, e.g. a stock’s average price over past 52 weeks is Rs. 20 which is normal, but the same was traded at Rs. 200 before this time period. These type of data-points are called contextual outliers.
Both supervised and unsupervised models can be used to detect the anomaly or the outliers. In supervised learning you must label your data with the known cases of outliers and then train your model using Machine learning or deep learning algorithms like SVM, LSTM, etc. to predict the outliers on the unseen data. In the unsupervised learning, you can find patterns in your data which are behaving differently from all the data or the data points close to it. In this blog, I will discuss Median Absolute deviation (MAD), clustering techniques, specially DBScan which can be used to detect outliers.
For supervised learning we need labelled data, i.e. we need to have a target column with each data point labelled as an ‘outlier’ or an ‘inlier’. Getting such labelled data is very rare because (a) labelling of data is very costly and requires investigation by auditors (b) number of positive cases (outliers) constitutes a very tiny percentage of the total number of samples, i.e. imbalanced class problem. In outlier detection problem you can handle the issue of unlabelled data by selecting data which is known to be manipulation free and then injecting artificial outlier points at random intervals and labelling them accordingly. Once you have created this data with artificially injected outliers, the data is ready for being used by any Machine Learning or Deep Learning algorithms to train the model and then use the trained models to predict the unseen test data points of the similar time-series. Models like SVM, RNN, LSTM are said to work better on this type of data1.
Unsupervised learning is free from the issue of labelled data, i.e. for this type of models we do not need labels. These models try to learn patterns in your data and try to bucket them into various clusters based on some similarity or distance metrics. For the problem of outlier detection, you can use both statistical and machine learning models to cluster them into inlier and outlier. In this blog, I will explain the two most widely used models, Median Absolute Deviation (MAD) and DBScan, used in the industry for outlier detection.
MAD (Median Absolute Deviation)
In statistics, the median absolute deviation is a robust measure of the variability of a univariate sample of quantitative data.
For a univariate data set X1, X2, …, Xn, the MAD is defined as the median of the absolute deviations from the data’s median:
Example: If you have data [1,3,5,7,20], the median is 5, then MAD =median( [ 4, 2,0,2,15]) = 2
Now you can select tolerance level of 3 i.e if any point in median deviation is 3 points away than the MAD you can classify that point as an outlier. So in this case point, 20 in the original dataset is an outlier.
In this figure, I used the daily closing price of a scrip traded on NSE, which is alleged to have some outliers, from 1st Jan 2018 to 31st Jan 2020 and calculated the MAD based on the above formula with a sliding window of 5 and plotted the same on timeline. The red points are the outliers detected in the stock price. The shaded area is the tolerance level of +- 3 from MAD. The red dots are the outliers the blue dots are the inliers and the shaded area is the tolerance level. Out of 503 data points, 54 points were detected as outliers.
Clustering Based Analysis
DBScan (Density-Based Spatial Clustering of Applications with Noise)
The centroid-based clustering algorithm like, KMeans is an iterative clustering algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. Outliers are those points that contain extreme values that do not conform to the points near them. Centroid-based clustering algorithms are driven by distance measurement between points. Another drawback of these algorithms is that you have to specify the value of K i.e. the number of clusters. If an optimum number of k is not specified the outlier points may be clubbed together with the inlier points. Though there are various ways to find the optimum value of k, still it is to be decided by you. Whereas in density-based clustering, the points which behave similarly are coupled together and the outliers which essentially are loosely coupled from the inliers are thrown into different buckets. Also, the algorithms decide themself how many clusters should the data be divided.
Here I use the same data to perform clustering using DBScan and plotted the price on y-axis and time on the x-axis. The DBScan returned three clusters namely,
1 with 405 points 0 with 92 points and, -1 with 6 points
The -1 cluster is clearly the outliers but you can further revisit the 92 data points in cluster 0
In this blog, you got to know the about anomalies a.k.a outliers in time series and the approach to find the same. You also saw how we can use MAD and DBScan models to detect outliers. As further reading, I recommend you to find more about anomaly detection and its implication in Business.
Seyed Koosha Golmohammadi, Time Series Contextual Anomaly Detection for Detecting Stock Market Manipulation, University of Alberta 2016
Sujit Dhanuka is a Senior Data Scientist at INSOFE, Mumbai campus. He has worked in the field of NLP, Agent Based Economic Simulation, Level Reduction techniques for multi-level categorical variables and end-to-end ML pipeline, Front-end development for ML projects using python and Flask. Before joining INSOFE he had founded a company under the name and style Dhanuka Info Systems Pvt. Ltd. for providing IT support in the North Eastern region of India.