A recurrent neural network consists of several fixed activation function blocks, one for each time step. Each block has an internal state called the hidden state of the block. This latent state means that it was known in the past that the network is currently being stored at a given point in time. This latent state is updated at each time step to indicate a change in the network’s knowledge of the past. The hidden state is updated using the following repetition relationship:
- The new hidden state - The old hidden state - The current input - The fixed function with trainable weights
Note. Generally, to understand the concepts of a recurrent neural network, it is often illustrated in expanded form, and this post will follow that normal.
At each time step, a new latent state is computed using the repetition ratio as above. This newly generated hidden state is used to create a really new hidden state, and so on.
The main workflow of the recurrent neural network is as follows:
Note that is the initial hidden state of the network. This is usually a vector of zeros, but it can have other values as well. One technique is to encode data assumptions into an initial hidden state of the network. For example, to solve the problem of determining the tone of speech pronounced by a famous person, the tones of that person’s past speeches can be encoded into an original latent state. Another method is to create an initial hidden state as a learning parameter. While these methods add a little nuance to the network, initializing the hidden state vector to zeros is usually an efficient choice.
How each recurrent block works:
- Take on input is the previous latent state vector and the current input vector.
Note that since the latent state and current input are treated as vectors, each element of the vector is placed in another dimension that is orthogonal to the other dimensions. Thus, each element, when multiplied by another element, gives a non-zero value only if the elements involved are non-zero and the elements are in the same dimension.
- Element-wise multiplication of the latent state vector by the latent state weights and similar In this way, the element-wise multiplication of the current input vector and the current input weights is performed. This generates the parameterized latency vector and the current input vector.
Note that the weights for the different vectors are stored in the learning weights matrix.
- Perform the vector addition of the two parameterized vectors, and then compute the elementwise hyperbolic tangent to generate a new latent state vector.
During training the repeating network, the network also generates output at each time step. This output is used to train the network using gradient descent.
The backpropagation used is the same the one used in a typical artificial neural network, with some minor modifications. These changes are marked as:
Let the predicted network output at any given time be and the actual result will be Then the error at each time step is defined as:
The total error is determined by summing errors at all time steps.
Similarly, the value can be calculated as the sum of the gradients at each time step.
Using the chain rule Exploring and using the fact that the output at time step t is a function of the current hidden state of the repeating unit, the following expression occurs:
Note that the weight matrix W used in the above expression is different for the input vector and the latent state vector and is thus used for notational convenience only.
This produces the following expression:
Thus, backpropagation through time differs from typical backpropagation only in that the errors at each time step are summed up to compute the total error.
While the basic recurrent neural network is quite efficient, it can suffer from n natural problem. For deep networks, the backpropagation process can lead to the following problems:
- Fading gradients: This happens when the gradients get very small and tend to zero.
- Exploding Gradients: this happens when gradients get too large due to backpropagation.
The gradient explosion problem can be solved with the — by setting a threshold for gradients transmitted over time. But this solution is not seen as a solution to the problem and may also reduce the efficiency of the network. To cope with such problems, two main variants of recurrent neural networks have been developed — networks with short-term short-term memory and networks with recurrent Gated units .