Macrobase - Prioritising Attention in Data
Some time ago I played around with Macrobase, as I have been interested in anomaly detection for analysing monitoring metrics. It is an academic project that produced a tool and a methodology, like it is often the case, the tool itself is more of a proof of concept useful for exploring the methodology. So I will focus on the method part and the problems it can address.
How do you find interesting data-points?
This obviously heavily depends on the domain, but in a gross generalisation you would like to find the unexpected. Because of my job I am personally interested in metrics from the production web-servers, I will use it as an example.
It is Easier Once You Know What to Expect
A typical web service targeted at consumers will follow typical daily activity pattern of humans - not much going on at night, then some peaks around the daily milestones like lunch, coming back from work etc. As everything human, these patterns are messy, but still carry some signal. An example of the unexpected - or an anomaly - in this case would be a burst of activity in the middle of the night. Once you understand the usage pattern you can probably find some big anomalies just by looking at the charts with your own eyes. In short, an expert knowing the system with some basic charts can already be an anomaly detection system.
But You Don’t Have to Rely On Your Own Eyes Only
It is pretty labor intensive to look at every chart. Not exactly scalable or cheap either. Humans can be pretty good at it but they are still humans, inconsistent, plus they get bored. Which comes easy as the anomalies by their very definition don’t happen often.
This is exactly what anomaly detection systems are designed to solve. Macrobase is one such system (both in the software sense and also as a methodology). On a macro-level (forgive the pun) Macrobase provides the following tools:
- finding outliers (univariate and multivariate), batch or streaming
- providing explanation for the outliers (commonalities between them)
- wrapping it all in a custom SQL operator
It employs two different metrics for detecting the outliers: MCD (Minimum Covariate Determinant) and MAD (Median Absolute Deviation). The later for the univariate data (single dimension) and the former for the multivariate (more than one dimension). These two metrics are responsible for giving a score in Macrobase, if a point gets a high enough score, it is classified as an outlier. Both of these measures are considered to be robust, which means that they work also with pretty noisy data, as opposed to a more known z-score for instance.
One additional trick the Macrobase authors employ is the Adaptive Damped Reservoir (a novel contribution from their paper), it makes it possible to apply the outlier metrics to data in real-time, for instance while streaming it from some source, as it would be convenient for detecting anomalies in production web applications but also in many other cases.
The Apriori algorithm helps to make sense of the outliers MAD and MCD gave us. It can be used for finding commonalities between the outliers, for the explanation part of the job. I won’t go into detail about Apriori but In short, Apriori is a method to find a combination of classes (categorical values) that are most frequently associated with given series of data. Good example of how the output would look like comes from one of the case studies authors presented in their paper, see the next section.
Except ADR most of these algorithms and measures are not novel, what is new in Macrobase is the integration of all these techniques into a consistent methodology that has been implemented into a working tool and applied to several use cases.
In the original paper (I encourage you to read it - it is quite accessible) authors mention several use cases. I personally had a look at the one using Italian Telecom data. It was a good demonstration of detecting outliers and finding interesting clusters among them on a multi-dimensional data (the telecom data contains multiple columns of measurements of differing resolution and type, number of texts, start of GSM call).
Have a look at this Kaggle notebook it presents the very same data that was analysed with Macrobase. In short, authors gathered some aggregated and anonymised data about use of GSM in Milan over the period of a few months. There are a few regions that can be clustered, like the city centre, university campus and the park, each with their own distinct GSM use pattern, the campus being quieter on the weekend for instance.
If you feed Macrobase a series of data that comes from one such cluster (lets say the campus) and using its custom SQL operator DIFF and compare it to the rest it will return the categorical values (labels) that are more frequently associated with the campus than the rest. Let’s say I map the time column to a categorical variable: day, night and then add one more: weekend/week. The Apriori algorithm employed by the Diff operator will yield weekend and the night as “explanations” for the outliers in the campus data (unusually low GSM use).
The example above is quite trivial but I hope it gives you an idea of how Macrobase methodology can be useful for finding interesting data points and coming up with some commonalities the outliers share.
I find the Macrobase approach interesting and would like to see a production version of such tool that could be readily applied to finding anomalies in real-time. If have time I might even try to implement a lightweight version of the Macrobase approach in a script / Jupiter notebook just to understand it better.