Outlier Identification in Continuous Data Streams With Z-Score and Modified Z-Score in a Moving Window
How do you spot outliers in continuous data streams? Learn to catch network performance hiccups effortlessly using z-score and modified z-score.
Join the DZone community and get the full member experience.
Join For FreeMy name is Maksim Kupriianov, and for the past few years, I have been actively involved in network monitoring. This means sending network probes from different locations within an organization’s network, analyzing the responses with regard to packet loss percentage and response times, and identifying places in the network where something has gone wrong. The probes are sent and the results are analyzed continuously, with new data coming in every second. The analytics are conducted on a sliding window of several tens of seconds.
During this process, local anomalies often arise. For instance, a server in one of the racks might start showing increased packet loss due to the server's high CPU utilization. On the one hand, this is already a problem that should be highlighted, but on the other hand, it is not caused by any network equipment failure and doesn’t affect network performance in general. A similar situation occurs with the response times: for various reasons, responses could be delayed even by seconds, whereas typically, the round-trip time of the requests does not exceed a few milliseconds.
How can we automatically detect such isolated anomalies and exclude them from the overall analysis? Many methods exist for this purpose, including Kalman filters, clustering, and machine learning algorithms. However, in this article, I want to focus on one of the methods of standardized evaluation: z-score filtering. The reason is that it is very simple to understand and implement, cheap in terms of additional CPU and memory load, and quite effective.
Z-Score
A z-score, also known as a standard score, is a metric that quantifies how far an observation deviates from the mean of the population. Essentially, it indicates the number of standard deviations below or above the mean that an observation falls. It is calculated for each observation using the formula:
z =xi-μσ
Where:
- xi is the individual data point
- μ is the mean (average) of the dataset
- σ is the standard deviation of the dataset
Let's look at an example. Consider a set of measured network latencies between two groups of servers located in the same building:
Latency, µs |
25 |
26 |
26 |
26 |
26 |
26 |
27 |
27 |
27 |
28 |
28 |
28 |
29 |
30 |
30 |
32 |
35 |
40 |
52 |
97 |
If we calculate the standard scores for these values, we get the following results:
Latency, µs |
25 |
26 |
26 |
26 |
26 |
26 |
27 |
27 |
27 |
28 |
28 |
28 |
29 |
30 |
30 |
32 |
35 |
40 |
52 |
97 |
Z-Score |
-0,520 |
-0,457 |
-0,457 |
-0,457 |
-0,457 |
-0,457 |
-0,394 |
-0,394 |
-0,394 |
-0,331 |
-0,331 |
-0,331 |
-0,268 |
-0,205 |
-0,205 |
-0,079 |
0,110 |
0,425 |
1,182 |
4,018 |
Outlier? |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
YES |
In this example, a typical threshold value of 3 was selected to identify anomalies. It is clear that the method only partially detected the anomaly. This occurred because the distribution of latencies is heavily skewed to the right: achieving a latency significantly lower than the median is physically impossible but experiencing delays due to network queues or similar issues resulting in higher values is common.
The advantage of the z-score technique lies in its effectiveness and exceptional convenience when used in sliding time windows, requiring only three parameters for calculation:
- SumX=i=1Nxi, the sum of all values in the window
- SsqX=i=1Nxi2, the sum of the squares of all values in the window
- N, the number of values
Thus, as new data arrives, we simply increment these counters, and as old data becomes obsolete, we decrement them accordingly. The standard score for each element Xi can be calculated using the following formula:
z = xi-SumXNSsqX-SumX2NN-1
Modified Z-Score
But can we do better? In our case, yes, and the modified z-score method will help with this. It performs better, especially on skewed datasets. The modified score is calculated as follows:
zmod=0.6745xi-MMAD
Where
- xi - Original sample element
- M - Sample median
- MAD=median(xi-M) - Median absolute deviation of the dataset
In cases where the median absolute deviation equals zero, another formula is used.
zmod=xi-M1.253314μAD
Where
-
μAD=i=1Nxi-MN, mean (or average) absolute deviation
Latency, µs |
25 |
26 |
26 |
26 |
26 |
26 |
27 |
27 |
27 |
28 |
28 |
28 |
29 |
30 |
30 |
32 |
35 |
40 |
52 |
97 |
Modified |
-1,009 |
-0,673 |
-0,673 |
-0,673 |
-0,673 |
-0,673 |
-0,336 |
-0,336 |
-0,336 |
0,000 |
0,000 |
0,000 |
0,336 |
0,673 |
0,673 |
1,346 |
2,355 |
4,038 |
8,075 |
23,217 |
Outlier? |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
no |
YES |
YES |
YES |
A threshold value of 3.5 is recommended for use with the modified z-score. In our example, employing this threshold successfully identified all three anomalous results. However, in practice, it is advisable to initially use higher threshold values, such as 5.
One significant drawback of the modified z-score algorithm is its requirement to recalculate the median (and median absolute deviation) for each incoming value. This necessitates maintaining all values in the moving window in sorted order and performing calculations anew each time, which can be challenging and resource-intensive for large datasets.
This challenge can be mitigated by employing a combination of both moving and sliding windows simultaneously. To distinguish between the two:
- Moving Time Window: This spans a fixed duration, such as ten seconds.
- Sliding Time Window: This moves continuously over the data, for example, every five seconds.
Moving time window |
Sliding time window |
||||||||||
For instance, using a moving time window of ten seconds alongside a sliding window of five seconds allows for recalculating the median and its absolute deviation for the previous interval each time the sliding window advances. This approach enables analysis over the moving window while maintaining stability in the dataset over shorter periods of time.
While this method may occasionally lead to false negatives and the exclusion of some trend-setting values, these effects are typically short-lived as subsequent sliding windows adjust accordingly.
Conclusion
The z-score method of standardized evaluation demonstrates strong performance in filtering out occasional anomalous outliers from a continuous stream of measurements. It remains cost-effective even when applied within a sliding window framework. However, its effectiveness may be compromised when dealing with datasets that exhibit significant skewness or distributions departing from normal.
In such instances, the modified z-score method proves more suitable due to its capability to handle skewed data distributions effectively. Nevertheless, applying this method to continuous data streams presents challenges, as recalculating the median absolute deviation with each new sample necessitates accessing and potentially sorting all values. Despite these challenges, recalculating at smaller intervals may be feasible in scenarios where the sample median remains stable.
Opinions expressed by DZone contributors are their own.
Comments