Transcript:
Hey, guys, welcome back in this video. We’ll be talking about a pretty interesting topic. In time. Series analysis called anomaly detection. So as you probably know, an anomaly is something that’s out of the norm or something that is unexpected and with the time Serie will see what that means visually and we’ll see how to first detect it and then how to account for it so that we’re not getting really bad results in our predictions. We’ll be looking at the same catfish sales data. We looked at in some previous videos. This data will be linked in the description, and this notebook will also be linked in the description. So I won’t go through all the stuff we went over before. This is how the data looks. We’ve taken a snapshot of the data to analyze from 1996 January until 2000 January. So we have these four years of data, we see there’s a clear seasonal pattern between these gray – lines, which separate years and we should be able to exploit that seasonality and we’re gonna build a sarima model, which is a seasonal auto, regressive integrated moving average in order to predict the last six months, given the first three and a half years so again. The code that’s coming up. Here is basically just the prediction, which is all content in other videos, but suffice to say when we use a rolling forecast origin to predict our six months. This is what our prediction looks like. So the orange line you see here is predicted and the blue line is the data, so we see it’s not bad, pretty good. Actually, we see the average percent error is three percent. And we see that the root mean squared error is just around 800 so. I’d say this is a success now. Here’s where the trouble is going to start coming in. I’m going to artificially introduced an anomaly, so I’m going to set December 1st, 1998 I’m going to set that data point to a value that we would not expect. So I do that first thing. We’ll look at how the data looks so very obvious where the anomaly is so right here at December 1st 98 and we see that we would expect this value to be maybe around the December value here the December value here, maybe a little bit higher because we see it’s going up over time, but we would not expect it to be this low and as weird as this anomaly looks, it’s not something that’s impossible in the real world as much as you expect something to be seasonal at all times. Sometimes a big event happens in the world that just breaks that seasonal pattern, in fact, we’re in the middle of one of those big events right now with the coronavirus crisis. I would imagine many many time series that people are looking at have been disrupted by the current crisis and crises that have come before and crises have come after so it’s up to us as these statisticians or the economists to figure out how to first aid detect that there’s an anomaly here and secondarily how to correct for it, so our predictions aren’t screwed up the first thing. I want to show, though, is what if we fail to detect the anomaly? What if I just run the same code below and assume that everything is gonna be fine? Then we see our predictions look like this, so they’ve been highly skewed. We see the first couple months are actually okay, But as soon as we get to the December time point here, it’s been highly under predicted, and that makes sense because it’s using the same December data from this point this point in this point so things that should predict something really really low. But in reality, this was just an anomaly in this year and because of predicting that really low, we then overcorrect and predict the next month really high, so this is going to be propagating the problem forward, and that’s not something we want so first we need to detect the anomaly. How are we gonna figure out that there is an anomaly, because before we remove it, we need to know it’s there. We’re look at two attempts, one of which I think is stronger than the other. The first attempt, which is kind of naive, is to use the deviation method. I can explain this pretty easily, So here’s the time series again. Let’s say that we start looking at the beginning of the data, and then we basically just have a ever expanding window so first we look at just the first data point. Then we look at the first two, then the first three, and we just have this window expanding and for each window size, we calculate the standard deviation of all the numbers in that window, so let’s say our window is currently this whole first year. We would calculate the standard deviation of these twelve data points. If our window is now the first two years, we would calculate the standard deviation of these first 24 data points and so on and the reason we’re doing, this is because when we get to the anomaly, our standard deviation is going to go way up because we found a value that’s out of the average out of the norm so basically, if we plot the standard deviation over time, we should be able to use this as a sort of anomaly detection method, So that’s what? I’ve done down here. This is the standard deviation over time. Now, one thing to know is that we kind of have to ignore the first couple. Because when you take the standard deviation of just two or three numbers, the variation is going to be pretty high, just because there’s not too many numbers, so although the highest point on our graph is actually the first one, it’s not really high because it’s an anomaly, it’s high because there’s not a lot of data, so that’s one of the drawbacks of this method is it’s difficult to use it if the anomaly occurs very early on, but if we just kind of burn out the first couple data points, let’s say we just start looking at this gray bar in the second year, then we see that the highest point clearly is December 1st, 1998 which is exactly where the anomaly is and again. The intuition is that when we reach that anomaly, the standard deviation of the series goes way up because we found a value that is very out of the ordinary now as intuitive. Business method is obviously there’s drawbacks and the one drawback is obviously that the first couple data points can’t really be used as anomalies with this method and there’s other drawbacks, too, because anomalies aren’t necessarily just very extreme values. They can be sometimes just unexpected values for other reasons, so let’s look at a slightly more robust method. Let’s look at this seasonal method, so this is exploiting the idea that we know that there’s seasonality an idea, so we’re gonna use that fact to help us out and this is also somewhat intuitive to explain so the chart. I’m about to show you This chart basically shows me. What’s the standard deviation If I look at only data points within a given month. So if I come back up to our main plot here, let’s say I look at just January, so I’m looking at this data point this data point this data point and this data point, so I’m looking at the for January observations I have, and I’m calculating the standard deviation between them. I would expect this deviation to be rather low because since I’m assuming there’s a seasonal pattern, I’m assuming that Januarys are pretty correlated with each other. February’s are pretty correlated with each other, and every month is somewhat similar to the same month in different years. So if I plot that deviation by month, we see that most of the months have somewhat of a usual deviation, but when we get to December, the deviation jumps way up, and that makes sense because we’re taking the standard deviation of this December this December this December and this December. So we see that there’s a lot of variability due to the anomaly so based on this plot. We know that if we’re looking for an anomaly, we should look in a December, so the stage two of this method is that we need to figure out which December the anomaly occurs in. And basically what I’ve done here is subset the data to look at just the four Decembers. Obviously the anomaly. Is this very, very low number over here? But to solve that in terms of code, it’s also very simple. I basically just eliminate one observation each time and then calculate the standard deviation of the remaining ones. And if I were to look at the step where I’ve eliminated this observation and calculate the standard deviation of the other three, which are like the usual Decembers, then I would get a very small standard deviation, so I figure out in which case I get the smallest standard deviation and the number that I’ve left out will be the anomaly and in that way we’re able to identify that. The anomaly occurs at December 1st, 1998 So this is our more robust anomaly detection method Now before we go on to what to do about the anomaly. I should note that there are many many more even more robust anomaly detection methods, But I wanted to. Just expose you to a couple of options that you have, and if you are interested, please let me know in the comments and I’ll make more videos on robust anomaly detection but to close out this video. Let’s see what to do about the anomaly. At this point? We figured out that the anomaly is in December 1st, 1998 Here’s the simple idea. We’ll look at to close this video up. We’re just going to use the mean of the other. December’s so looking again At the graph. I figured out the anomalies here, so I’m going to use the average of the other Decembers. I’m looking at this one and this one. I’m not looking at this last December because that’s part of my testing set, so it wouldn’t really be fair to look at it, but let’s say I was looking at like, ten years of data. I would look at the average of all the December values that are in the ten years and I would use that as the new December value for this anomaly and the intuition there is that that’s a more usual value of a December rather than this very strange value I’m seeing, so if I do that, then this is what my data looks like. So the blue is the adjusted data series where the only difference is that, whereas the value of December was all the way down here before. Now we’ve adjusted it to be up here, which makes the whole series looks a little bit more usual than before, and now if we run predictions on all of that, then we get this prediction instead, so what? I’ve drawn here in Blue. Is the actual series, including the anomaly and the orange is our predictions, and we see that these predictions look a lot better than if we had done nothing to correct the anomaly and had these predictions and just to look at some numbers before we had seven percent error in terms of the prediction and now doing all of our anomaly. Detection and anomaly correction were down to about 4.5 percent, so we see that this was able to get us better predictions. So that’s it guys. This was our intro video on anomaly detection. Why you should care about it. How to detect them and how to fix for it? If you have any questions at all, please leave them in the comments below. And I will see you next time.