Transcript:
Hi, everyone, thanks for coming. So what I’m going to talk about. Today is two algorithms, we actually use the introduction to do time series forecasting and the first one that I’m going to talk about is forecasting with the fast Fourier a chance formation. And it sounds a mouthful, But what I was trying to come. I want to convince you. Is if you just started in this business if you don’t know much about this, and this is a really good algorithm start with. In fact, when I started my forecasting and project, I knew very little about forecasting after I knew none. And when I’m when my boss asked me to do this kind of forecasting, I was panicking and so I’ve got to find in two weeks, something simple, something easily understand, Bo and something easily implement, and this is what we each chose as our first iteration of our forecasting pipeline, so let’s dive into that and the key idea the key idea is we have a time source data. Let’s look at time series as a single variable. Right, there’s a time as your variable, and there’s a values then the key idea is. How do we decompose that if we can become? D composed of time series into something simple. Then we can forecast easily before I go into details, here’s. I promise you the only equation that I have, which is a sine function and just as a refresher, so sine function is it has an amplitude and it has a phase and also it has a frequency or period, That’s it. So why does it matter well because some great and some great person puree proved this out theorem. A reasonably continuous and periodic function can be expressed as a sum of a number of sine functions, and that’s the key to the algorithm that we use. Let’s look at an example, So if you look at this time series data and the X-axis is time, the Y-axi’s is a value. Can you predict what’s going to happen? I’d later it looks like it’s a pretty irregular time series, right, but is it, but if you look at it, if you decompose that into a series society functions, this function turns out to be regular, so this is the first one that has the largest amplitude and smallest period. Then you can decompose into the second one third one fourth one when you combine all of them just just by summation, then you will get the original time series. Now every single sign functioning is very regular. It’s very periodic if you can apply a forecasting, which is quite trivial. I high school math, and then we combine your forecasting. You get your forecasting results. So that’s the idea behind it. I’ll give you a reverse The example now I have to assign functions that I take from the previous example when they sum them up. I get the approximation of the original series and you can see that this one is almost similar to the original time series with much smoother curve right now. If you apply more and more sine functions, you get more and more accurate approximation, Okay, now we’re adding example in mind. I can claim that f of T is actually simple and the algorithm is you run f of T decomposition on your input data and then you filter out no amplitude or high frequency components. So if you have some data that has very low high frequency and very low. F into that means it’s most likely noise. I guess it happens very frequently and irregularly. And once you have this decomposition, you have a bunch of cyanide functions. You pick the first few that are most significant, and you apply for casting on that. Basically, you move your face forward. I get you’re forecasting, recombine them. You get your results as simple as that. Here’s the final example. The black curve is the original example. The blue curve is the smooth and the forecast results. You can see that they are almost close to each other. As long as the original data has certain periodicity, but what if there’s an outage, we live in an imperfect world when there’s an outage. Then our matrix has this huge drop as highlighted box and all of sudden our forecasting will be off. So if you look at this boxes, you can see that on the peak of the next q2 tops. Then you see, there’s a big chunk of errors and then, but there’s a way to fix that automatically. The solution is we iteratively. Apply this differences of input so we can adjust input based on the output of our algorithms. Let me illustrate if you look at these two boxes. This is where there is a huge difference and on the very bottom. There is a red line. That’s the error so you can see whenever there is a huge difference between your forecasting result and the original curve, then the error will. Spike, all right. So what if you can subtract this spike the error from your original input and then reapply your algorithm? What would you get what you do get? Is you get closer in the closer out and results. You get your results that become, that’s becoming more and more accurate over time until you hit certain threshold. Then you can stop and when you combine these two ideas together the decomposition and this iteratively approximation algorithm, then we get our first algorithm in production to predict Really simple time, sirs. Then where should we use? F of T one is when there is periodicity, right, and the right hand side, we divide our cities into large areas. Each area has enough enough number of trips. Each area has enough number of writes and drivers supply and demand so and so forth and when we can aggregate data into certain quantity than we get right-hand side, which is a periodic function now we can apply this algorithm to achieve certain level of accuracy. Oh, and we, you need a great job. There are two two advantages of this algorithm one. Is it’s really simple to implement fast? Fourier transformation is a standard algorithms algorithm. That’s been studied intensively by so many people for so many years. So you pretty much again. Get it get a prepackaged libraries for this kind of algorithms in any kind of languages, and then you can just use them. It’s also really recently fast to run and also it’s parallelizable again, Our previous case, In this case, we develop our cities into multiple regions and when we multiply that by 6 plus hundred cities, then we get thousands and thousands of time series to forecast, but each time service is independent of each other, so we can easily and have a distribution a parallelizable environment and have each course to only handle a subset of time series without interfering with each other and also. I want to emphasize that decomposition is really powerful. I’ll give you one Another example. That’s not for FFT, but for a different decomposition, It’s called STL so again. This is the original one. I take from a reference that you can check out later. You can decompose that into periods and then the trend and the noise. Now, once you have this, you can predict with the trend and then you can predict the way the European and the periodic functions. And then you can either decide to ignore the and noise, or you can just simulate notice for the future and recombine the results. So you get your forecasting results. Actually, that’s the idea behind the non-seasonal array methods. A Ramos stands for auto regressive, integrated, moving average, It’s a very popular traditional timeseriesforecasting algorithm. It’s also based on this essential idea decomposition, but we decided to move on, so there’s got to be a bottleneck, right The bottom at what the real bottleneck is, it’s really not easy to incorporate new signals, for example. What if I want to incorporate the weather? What if there is a huge game in the city that will jack up the supply the demand? I did this kind of things. What if I want to have a not so pure article but still regular kind of patterns that I want to decode and I want to incorporate. In that case. We need more our methods which brings to the next and algorithms. We tried forecasting with deep learning now. Deep learning is a huge topic. So I’m going only to talk about intuition. Let’s look at an example, here is a time Series data. It’s some kind of time series for our number of queries. He looks regular enough. But how can we deal with it first of all it even though it’s continuous, it’s not, this is the data we collected by nature. All the time series data we collect is not continuous, it’s continuous only because we perform interpolation if we don’t do any interpolation, it’s just number of discretize. The data points, which brings to the key idea time services, are actually sequences, But what does that matter because once we can discretize a time series into sequences then we can apply a very powerful technique called sequence to sequence. It was first published back in 2014 by Mer by Google to to solve the machine learning machine translation problems, but it turned out the sequence to sequence technique is very good at modeling timeseriesforecasting as well. I’ll start with an example. Let’s say we have time. We discretize them into C time series, right, and we have a the time time axis on the bottom. Now we have a time series like this. So if you look at this, and then we can support. If you look at this picture? It’s unfolded a structure of a neural network by a photo that I mean for every single input, we have a start in 1 input 2 so and so forth. Each one is a data point for given time stamp right and each one will be processed by a newer cell and it generates some kind of hidden states now with this, we can perform forecasting per input, and then the forecasting itself can be an input to the next forecasting. So now we can combine some kind of history and some kind of sequence in with this neural network to forecast something, but it’s not done yet the thematic of this or the most the power of this approaches, not only can we an input time source data, we can also encode a lot of different signals again. This particular case the time of week because if we have certain periodicity most likely for the same time of the same week will get similar results, right, what if we can encode this signal into our forecasting pipeline and the same thing and now yet we also have weather data and weather data contains temperature, humidity, precipitation, wind weather types. Some of them may be just may be relevant to our forecast. Some of them don’t, but the key is. We want to enter everything so that we can train our model to assign certain weights to signify the most impactful factors in the weather and the beautiful part is the weather itself is just a vector, which is naturally the input of our newer machine, then another key pieces. What about our recent context, usually my? Masters data is not independent does not have just independent data points, right, the data points are dependent with each other, especially along the time axis that’s. Why if you read the literature? There is a key concept called. Auto regression, meaning that the current the value of the current time may depend on the value of the past times and you can extend the past, however, long you like now. What if how can we encode this kind of information into our algorithms well? Some very smart people have already come up with a really really elegant solution. Its car and essentially there are two components. They’re both neural networks, But the first piece, as highlighted by the box is is used to encode just the past. When you we go back M time. M can be any unit, for example, M months in our particular case for our Nong range in turn forecasting. We use ours so it could be M hours. If it’s week, then it’s 168 hours times one week or, however, a number of weeks right so we can use the historic data. The most recent one use this machine at this particular plant component to train them to get certain weights and then and then use the result as the input to the original neural network Noon work that we discussed and then when we combine this, we get a much more accurate result. This is the idea called Encoder Decoder Architecture, Just one minute, and that’s it. So the summary if you have to have two takeaways, one is decomposition is a really powerful tool in timeseriesforecasting. Use it. The other one is timeseriesforecasting can be modeled as a sequence to sequence problem, thank you! [APPLAUSE].