RNNs differ heavily from other common neural network architectures in the way they input and output data. Think for example of an image classification problem where you input an image and output the estimated classification. The inputs and outputs are one fixed vector. RNNs on the other hand are able to input and output sequences of vectors.
This means you could for example input the development of a stock price and the RNN will give you an estimate for the next years. It detects patterns in the input sequence and learns when they will probably reoccur. Stock prices can of course fluctuate heavy if something unusual happens. This is where the RNN will most likely fail. RNNs are unable to learn something that rarely or never happened before or that does not appear in some sort of intervals.
But they are really good in showing a trend which is what we aim for. Let me show you how this works.
Time series forecasting
Here you can see the visualization of an example time series and the first 20 rows as a table. I got this dataset from here. It shows sunspot activities over the last couple of hundred years. If you want to know more about this, have a look here.
You can definitely see a trend in the data but how will it actually develop further on?
It’s all about sequences!
The aim of RNNs is to detect dependencies in sequential data. This means they intend to find correlations between different points within a sequence. There are two kinds of dependencies. Short-term dependencies describe a dependence in the recent past. Long-term dependencies on the other hand are correlations between points in time that are far away from each other. Note that there is no distinct boundary between these two so you can’t say for sure when short-term ends and long-term starts. But finding such dependencies makes it possible for RNNs to recognize patterns in sequential data and use this information to predict a trend.
A certain point in a sequence is called a time step. Their total number is the sequence length. For every time step in the sequence we have a feature vector that consists of the values we want to track. We can have several different features but for simplicity we will stick with just one. Therefore a sequence is two dimensional with the shape [time step, feature value].
We can modify RNNs to different varieties of using sequences as inputs or outputs as you can see here:
The “Normal Net” type shows a net architecture we for example know from Feed Forward Nets. The circles are vectors so you can see that the input and output are one fixed vector. From the second type on we can see RNNs. The orange circles combined are the input sequence we feed into the net. Every circle itself stands for a feature vector at a distinct time step. The hidden layer in blue processes the input and outputs the prediction for the next time step(s). The output can either be the overall model output or another hidden layer.
Looking at “Sequence to Sequence Synced” you can see how we input the features at time step 1 and output a prediction for the following time step (in this case step 2). We repeat this for every time step until we reach the end of the sequence. The twist of RNNs is that we input the predicted values from previous time steps as well. In the image this is illustrated by the horizontal arrows in the hidden layer. This is the recurrent part of the net but more on this soon.
Don’t get the image wrong! “The Sequence Input” and “Sequence To Sequence” variants also do a prediction at every time step. The difference is that they don’t pass it on to the next layer or use it as the overall model output. They just pass on the predictions to the next time step. This means that you can freely choose which variant to use, it is just dependent on your needs or what you think is best for you.
Getting time series data in shape
To train an RNN we need to refactor the sequence into a number of sequences to make a supervised learning problem out of it. Because of the fact that sequences are two dimensional the data we feed into the network is then three dimensional.
Let’s say we want to create sequences that are 10 time steps long. In practice this would possibly be too short but we want to keep it simple. The following image shows how to split up the data for the first three sequences:
The orange tables are the sequences we train the network with. As you can see every following sequence is shifted forward by one time step. The labels are the values that immediately follow their respective sequence and are the actual values we want to predict during training.
What does Recurrence mean?
The hidden units in an RNN layer are different from conventional neural networks so we call them recurrent units. Typically RNN layers only consist of a few units and it is also possible to use just one unit per layer. I will now explain how the recurrent units work, have a look at this image:
The recurrent unit computes an output vector for every time step. Obviously the vector is time dependent so we call the output for the current time step yt. At the bottom you can see the input feature vector xt. We feed this one periodically to the net time step after time step from the input sequence. In an RNN we call the feature vector for the current time step the present input. Additionally the RNN feeds it’s own output as an input at the following time step (where it will become yt-1). We call this vector the recurrent input.
We compute the output vector with the input vector xt, the recurrent input yt-1 and the help of an activation function g:
yt = g ( W * xt R * yt-1 )
For time series problems it is common to use a tanh function for g. Both the present and recurrent input vectors are multiplied by a weight matrix (W and R respectively).
The way the net uses its previous output as an input is what we call Recurrence. It allows the net to remember what it learned from previous time steps. This is very important for learning long and short term dependencies.
Training and Backpropagation through time (BPTT)
The inputs to a recurrent unit are weighted which means they are multiplied by a weight matrix. The weights describe the importance the model gives to certain values. The net learns by adjusting the weight matrices to values that lead to a better prediction.
After a full pass of a sequence the net can evaluate how good the predicted values are in comparison to the actual ones (the labels) and calculate the error. The net now goes back the whole sequence and adjusts the individual weight matrices so the error minimizes. This process is called backpropagation.
Because of the additional time dimension RNNs need to use a special form of backpropagation. Normally (for example in a feed forward network) backpropagation goes back through the different hidden layers where an optimizer function adjusts the weight matrices. For RNNs we also need to go back in time adjusting all the weights of previous time steps.
BPTT can become a problem if the sequence is very long and we have to go back all the way after every prediction. Truncated BPTT (TBPTT) solves this issue by splitting up the sequence. Every time backpropagation is applied we only have to go back the length of the truncated subsequence we are currently in.
The downside of this is that the net can only learn dependencies within these subsequences. You need to be aware of that when choosing their length when defining or tuning your RNN.
Problems of RNNs
The problem with vanilla RNNs like described above is that they are only able to detect short term dependencies. Reason for this is the so called vanishing gradient problem.
In RNNs the vanishing gradient problem appears when we back propagate a sequence. The further we go back the sequence the less importance the learned values can have on the current prediction. This prevents the model to learn long term dependencies and makes it ineffective. Therefore we need to find a way to avoid the vanishing gradient problem.
If you want to know more about this, I recommend this answer.
Long Short Term Memories (LSTMs)
LSTMs are a more complex variation of an RNN that are able to learn long term dependencies. They solve the issue with the vanishing gradient problem.
So let’s have a closer look. In an LSTM the recurrent unit is called an LSTM block or just block. The block works like a normal recurrent unit but has an additional cell and gates. The gates help to determine the long-term dependencies by controlling the data flow inside the block. The cell gives the net some sort of memory and retains the long term dependencies.
Like in a vanilla RNN the input data consists of the present input and the recurrent input. The task of the gates is to control which information is important for the prediction at every time step and which is not. There are three different gates which we invoke in the following order at every time step:
- Forget Gate: Determines the information we want to remove from the cell (and therefore forget)
- Update Gate: Determines the information from the input data that we want to add to the cell
- Output Gate: Determines which information of the cell is useful for the current prediction
The training and backpropagation process here works the same but with the addition that the gates also have weights the model needs to learn. That way the model gets better in determining the right information that leads to good predictions.
In the next section I will go deeper into how the LSTM block works and won’t shy away from mathematical equations.
The Math behind LSTMs
The following image shows the architecture and the workflow of an LSTM block:
I will describe it part by part so in the end the whole picture will make sense to you.
Let’s start by having a look at the block input. The data consists of the present input xt (orange) and the recurrent input yt-1 (yellow). The block directs both vectors to the three different gates and the input activation function. There we multiply the input vectors with weight matrices which are denoted by an encircled “W”. Note that these are four different weight matrices, all trained individually during the training process.
Activation Functions make training faster and less error-prone by normalizing values into a certain range. This is also called squashing.
tanh for example squashes the values in a range of -1 to 1.
The input activation function uses tanh and controls the data input flow like in a normal RNN:
it = g ( Wi * xt Ri * yt-1 )
However the gates use Sigmoid activation functions (denoted by σ) that will squash the values in a range of 0 to 1. The idea of the gates is to let a specific amount of information through and we achieve this with Sigmoid. A value of 1 means “let everything through” and a value of 0 means “let nothing through”.
The gates have different tasks of modifying the data flow in the block to further determine which information is useful for the current prediction and which is not. They result in outputting a vector. Mathematically they work exactly the same:
ft = σ ( Wf * xt Rf * yt-1 ) (Forget)
ut = σ ( Wu * xt Ru * yt-1 ) (Update)
ot = σ ( Wo * xt Ro * yt-1 ) (Output)
The gates add together the weighted inputs and squash the values with a Sigmoid function. The net trains the weights to better determine which information is important and consequentially make a more precise prediction. Therefore the gates only differ in the trained weights and the task they have.
In the center of the block we can see the cell. This is basically just another vector that serves as the memory. At the start of a new time step we only have the cell state from the previous step, ct-1. Therefore our goal is to compute the present cell state ct. We do this in two steps, namely forgetting/removing and updating/adding.
First we remove information from the cell state that is not important anymore for the current prediction. We do this by multiplying the result of the forget gate with the old cell state.
Next we want to add new information to this. For that we need to define what we want to add. As you can surely remember this was the task of the update gate. We therefore multiply the update gate vector ut with the input data it.
Everything combined this formula shows how we calculate the present state vector ct:
ct = ( ct-1 * ft ) ( it * ut )
We now use the cell state ct and the output gate vector ot to compute the output yt.
yt = h ( ct ) * ot
Where h is the output activation function that squashes the cell state vector. This is usually tanh again.
We also use the output as the recurrent input for the next time step. Now the computation cycle of the current time step is concluded and we move on to the next one. Consequently yt then becomes yt-1 and ct becomes ct-1. This flow is the reason why we call these neural nets recurrent.
Now you know the theoretical basics of LSTMs and what is important when building such a model for time series forecasting. In my opinion it is always a good idea to understand this before applying it practically. That way LSTMs appear less like a magical black box and you can use them to their full potential and get better results.
Do you still have questions or want to know more? Please be free to leave a comment!
In another upcoming post I will describe a more practical approach to LSTMs for time series forecasting, so stay tuned!