8.2. Computing the gradients for Back Propagation Through Time

8.2.1. Loss function

We will use the mean squared error (MSE) as our loss function. The MSE is defined as follows:

$$ L = \frac{1}{2} (y^{(T)} - Y^{(T)})^{2} \tag{8.3} $$

8.2.2. Forward Computational Graph

Fig.8-4 illustrates the forward computational graph of the LSTM.

Fig.8-4: Forward Computational Graph of Many-to-One LSTM
Computational Graph

For an explanation of computational graphs, see Appendix.

8.2.3. Backward Computational Graph

To build the backward computational graph, we will calculate the derivatives between nodes of the above graph.

Since the derivatives of the dense layer have been calculated in Section 4.1, we focus on calculating the derivatives of the LSTM.

$$ \begin{align} \frac{\partial h^{(t)}}{\partial o^{(t)}} &= \frac{\partial (o^{(t)} \odot \tanh(C^{(t)})) }{\partial o^{(t)}} = \tanh(C^{(t)}) \\ \frac{\partial o^{(t)}}{\partial W_{o}} &= \frac{\partial \sigma(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o})}{\partial W_{o}} = \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \ {}^t x^{(t)} \\ \frac{\partial o^{(t)}}{\partial U_{o}} &= \frac{\partial \sigma(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o})}{\partial U_{o}} = \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \ {}^t h^{(t-1)} \\ \frac{\partial o^{(t)}}{\partial b_{o}} &= \frac{\partial \sigma(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o})}{\partial b_{o}} = \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \\ \frac{\partial o^{(t)}}{\partial h^{(t-1)}} &= \frac{\partial \sigma(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o})}{\partial h^{(t-1)}} = \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \ {}^t U_{o} \\ \frac{\partial h^{(t)}}{\partial C^{(t)}} &= \frac{\partial (o^{(t)} \odot \tanh(C^{(t)})) }{\partial C^{(t)}} = o^{(t)} \odot \frac{\partial \tanh(C^{(t)}) }{\partial C^{(t)}} = o^{(t)} \odot \tanh'(C^{(t)}) \\ \frac{\partial C^{(t)}}{\partial i^{(t)}} &= \frac{\partial (f_{(t)} \odot C^{(t-1)} + i^{(t)} \odot \tilde{C}^{(t)} ) }{\partial i^{(t)}} = \tilde{C}^{(t)} \\ \frac{\partial i^{(t)}}{\partial W_{i}} &= \frac{\partial \sigma(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i})} {\partial W_{i}} = \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \ {}^t x^{(t)} \\ \frac{\partial i^{(t)}}{\partial U_{i}} &= \frac{\partial \sigma(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i})} {\partial U_{i}} = \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \ {}^t h^{(t-1)} \\ \frac{\partial i^{(t)}}{\partial b_{i}} &= \frac{\partial \sigma(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i})} {\partial b_{i}} = \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \\ \frac{\partial i^{(t)}}{\partial h^{(t-1)}} &= \frac{\partial \sigma(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i})} {\partial h^{(t-1)}} = \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \ {}^t U_{i} \\ \frac{\partial C^{(t)}}{\partial \tilde{C}^{(t)}} &= \frac{\partial (f_{(t)} \odot C^{(t-1)} + i^{(t)} \odot \tilde{C}^{(t)} ) }{\partial \tilde{C}^{(t)}} = i^{(t)} \\ \frac{\partial \tilde{C}^{(t)}} {\partial W_{c}} &= \frac{\partial \tanh(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c})} {\partial W_{c}} = \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \ {}^t x^{(t)} \\ \frac{\partial \tilde{C}^{(t)}} {\partial U_{c}} &= \frac{\partial \tanh(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c})} {\partial U_{c}} = \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \ {}^t h^{(t-1)} \\ \frac{\partial \tilde{C}^{(t)}} {\partial b_{c}} &= \frac{\partial \tanh(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c})} {\partial b_{c}} = \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \\ \frac{\partial \tilde{C}^{(t)}} {\partial h^{(t-1)}} &= \frac{\partial \tanh(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c})} {\partial h^{(t-1)}} = \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \ {}^t U_{c} \\ \frac{\partial C^{(t)}}{\partial f^{(t)}} &= \frac{\partial (f_{(t)} \odot C^{(t-1)} + i^{(t)} \odot \tilde{C}^{(t)} ) }{\partial f^{(t)}} = C^{(t-1)} \\ \frac{\partial f^{(t)}}{\partial W_{f}} &= \frac{\partial \sigma(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) } {\partial W_{f}} = \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \ {}^t x^{(t)} \\ \frac{\partial f^{(t)}}{\partial U_{f}} &= \frac{\partial \sigma(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) } {\partial U_{f}} = \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \ {}^t h^{(t-1)} \\ \frac{\partial f^{(t)}}{\partial b_{f}} &= \frac{\partial \sigma(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) } {\partial b_{f}} = \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \\ \frac{\partial f^{(t)}}{\partial h^{(t-1)}} &= \frac{\partial \sigma(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) } {\partial h^{(t-1)}} = \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \ {}^t U_{f} \\ \frac{\partial C^{(t)}}{\partial C^{(t-1)}} &= \frac{\partial (f_{(t)} \odot C^{(t-1)} + i^{(t)} \odot \tilde{C}^{(t)} ) }{\partial C^{(t-1)}} = f^{(t)} \\ \end{align} $$
Note

To avoid confusion, we express the transpose of a vector or matrix $ A $ as $ \ {}^tA$, instead of $A^{T}$, in this section.

Using these derivatives we calculated, we can build the backward computational graph shown in Fig.8-5.

Fig.8-5: Backward Computational Graph of Many-to-One LSTM

To simplify the following discussion, we define the following expressions:

$$ \begin{align} \text{grad}_{dense}^{(T)} & \stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial h^{(T)}} \tag{8.4} \\ dh^{(t)} & \stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial h^{(t)}} \tag{8.5} \\ dC^{(t)} & \stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial C^{(t)}} \tag{8.6} \end{align} $$
  • $\text{grad}_{dense}^{(T)}$ is the gradient propagated from the dense layer.
  • $dh^{(t)}$ is the gradient propagated from the hidden state $h^{(t+1)}$.
  • $dC^{(t)}$ is the gradient propagated from the cell state $C^{(t+1)}$.
Notation

In this document, “$d$” denotes the gradient (e.g., $dL, dh$), not the total derivative.

The following relation is satisfied by definition:

$$ dh^{(T)} = \text{grad}_{dense}^{(T)} $$

Next, we will calculate $dh^{(T-1)}$. Fig.8-6 illustrates the relationship between $h^{(T)}$ and $h^{(T-1)}$, which is extracted from Fig.8-5.

Fig.8-6: Relationship Between $h^{(T)}$ and $h^{(T-1)}$

As shown in Fig.8-6, $ dh^{(T-1)} $ can be calculated from $ dh^{(T)} $. There are four paths from $ h^{(T)} $ to $h^{(T-1)} $, we therefore add them together.

$$ \begin{align} dh^{(T-1)} &= \frac{\partial L}{\partial h^{(T)}} \left[ \frac{\partial h^{(T)}}{\partial o^{(T)}} \frac{\partial o^{(T)}}{\partial h^{(T-1)}} + \frac{\partial h^{(T)}}{\partial C^{(T)}} \frac{\partial C^{(T)}}{\partial i^{(T)}} \frac{\partial i^{(T)}}{\partial h^{(T-1)}} + \frac{\partial h^{(T)}}{\partial C^{(T)}} \frac{\partial C^{(T)}}{\partial \tilde{C}^{(T)}} \frac{\partial \tilde{C}^{(T)}}{\partial h^{(T-1)}} + \frac{\partial h^{(T)}}{\partial C^{(T)}} \frac{\partial C^{(T)}}{\partial f^{(T)}} \frac{\partial f^{(T)}}{\partial h^{(T-1)}} \right] \\ &= dh^{(T)} \tanh(C^{(T)}) \sigma'(W_{o} x^{(T)} + U_{o} h^{(T-1)} + b_{o}) \ {}^t U_{o} \\ & \quad + \ dh^{(T)} o^{(T)} \odot \tanh'(C^{(T)}) \tilde{C}^{(T)} \sigma'(W_{i} x^{(T)} + U_{i} h^{(T-1)} + b_{i}) \ {}^t U_{i} \\ & \quad + \ dh^{(T)} o^{(T)} \odot \tanh'(C^{(T)}) i^{(T)} \tanh'(W_{c} x^{(T)} + U_{c} h^{(T-1)} + b_{c}) \ {}^t U_{c} \\ & \quad + \ dh^{(T)} o^{(T)} \odot \tanh'(C^{(T)}) C^{(T-1)} \sigma'(W_{f} x^{(T)} + U_{f} h^{(T-1)} + b_{f}) \ {}^t U_{f} \tag{8.7} \\ \end{align} $$

Similarly, $ dh^{(t)} $ can be also calculated recursively.

$$ dh^{(t)} = \begin{cases} \text{grad}_{dense}^{(t)} & t = T \\ \\ \begin{align} & dh^{(t+1)} \tanh(C^{(t+1)}) \sigma'(W_{o} x^{(t+1)} + U_{o} h^{(t)} + b_{o}) \ {}^t U_{o} \\ & \quad + \ dh^{(t+1)} o^{(t+1)} \odot \tanh'(C^{(t+1)}) \tilde{C}^{(t+1)} \sigma'(W_{i} x^{(t+1)} + U_{i} h^{(t)} + b_{i}) \ {}^t U_{i} \\ & \quad + \ dh^{(t+1)} o^{(t+1)} \odot \tanh'(C^{(t+1)}) i^{(t+1)} \tanh'(W_{c} x^{(t+1)} + U_{c} h^{(t)} + b_{c}) \ {}^t U_{c} \\ & \quad + \ dh^{(t+1)} o^{(t+1)} \odot \tanh'(C^{(t+1)}) C^{(t)} \sigma'(W_{f} x^{(t+1)} + U_{f} h^{(t)} + b_{f}) \ {}^t U_{f} \end{align} & 0 \le t \lt T \end{cases} \tag{8.8} $$

Fig.8-7 illustrates the relationship between $C^{(T)}, C^{(T-1)}, h^{(T)}$ and $h^{(T-1)}$, omitted $ o^{(T)}, i^{(T)}, \tilde{C}^{(T)}$ and $f^{(T)}$.

Fig.8-7: Relationship Between $C^{(T)}, C^{(T-1)}, h^{(T)}$ and $h^{(T-1)}$

Using $ dh^{(T)} $, we can calculate $ dC^{(T)} $ as follows:

$$ dC^{(T)} = \frac{\partial L}{\partial C^{(T)}} = \frac{\partial L}{\partial h^{(T)}} \frac{\partial h^{(T)}}{\partial C^{(T)}} = dh^{(T)} (o^{(T)} \odot \tanh'(C^{(T)})) \tag{8.9} $$

As shown in Fig.8-7, $ dC^{(T-1)} $ can be calculated from $ dh^{(T-1)}$ and $dC^{(T)}$ as:

$$ \begin{align} dC^{(T-1)} &= dh^{(T-1)} \frac{\partial h^{(T-1)}}{\partial C^{(T-1)}} + dC^{(T)} \frac{\partial C^{(T)}}{\partial C^{(T-1)}} \\ &= dh^{(T-1)} (o^{(T-1)} \odot \tanh'(C^{(T-1)})) + dC^{(T)} f^{(T)} \tag{8.10} \end{align} $$

Therefore, we can obtain $ dC^{(t)} $ as follows:

$$ dC^{(t)} = \begin{cases} dh^{(t)} (o^{(t)} \odot \tanh'(C^{(t)})) & t = T \\ \\ \begin{align} dh^{(t)} (o^{(t)} \odot \tanh'(C^{(t)})) + dC^{(t+1)} f^{(t+1)} \end{align} & 0 \le t \lt T \end{cases} \tag{8.11} $$

8.2.4. Gradients

Using the results above, we finally obtain the gradients shown below:

$$ \begin{align} dU_{o} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial U_{o}} = \sum_{t=1}^{T} dh^{(t)} \frac{\partial h^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial U_{o}} \\ &= \sum_{t=1}^{T} dh^{(t)} \tanh(C^{(t)}) \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \ {}^t h^{(t-1)} \tag{8.12} \\ dW_{o} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial W_{o}} = \sum_{t=0}^{T} dh^{(t)} \frac{\partial h^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial W_{o}} \\ &= dh^{(0)} \tanh(C^{(0)}) \sigma'(W_{o} x^{(0)} + b_{o}) x^{(0)} + \sum_{t=1}^{T} dh^{(t)} \tanh(C^{(t)}) \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \ {}^t x^{(t)} \tag{8.13} \\ db_{o} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial b_{o}} = \sum_{t=0}^{T} dh^{(t)} \frac{\partial h^{(t)}}{\partial o^{(t)}} \frac{\partial o^{(t)}}{\partial b_{o}} \\ &= dh^{(0)} \tanh(C^{(0)}) \sigma'(W_{o} x^{(0)} + b_{o}) + \sum_{t=1}^{T} dh^{(t)} \tanh(C^{(t)}) \sigma'(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \tag{8.14} \\ dU_{i} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial U_{i}} = \sum_{t=1}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial i^{(t)}} \frac{\partial i^{(t)}} {\partial U_{i}} \\ &= \sum_{t=1}^{T} dC^{(t)} \tilde{C}^{(t)} \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \ {}^t h^{(t-1)} \tag{8.15} \\ dW_{i} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial W_{i}} = \sum_{t=0}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial i^{(t)}} \frac{\partial i^{(t)}} {\partial W_{i}} \\ &= dC^{(0)} \tilde{C}^{(0)} \sigma'(W_{i} x^{(0)} + b_{i}) x^{(0)} + \sum_{t=1}^{T} dC^{(t)} \tilde{C}^{(t)} \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \ {}^t x^{(t)} \tag{8.16} \\ db_{i} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial b_{i}} = \sum_{t=0}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial i^{(t)}} \frac{\partial i^{(t)}} {\partial b_{i}} \\ &= dC^{(0)} \tilde{C}^{(0)} \sigma'(W_{i} x^{(0)} + b_{i}) + \sum_{t=1}^{T} dC^{(t)} \tilde{C}^{(t)} \sigma'(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \tag{8.17} \\ dU_{c} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial U_{c}} = \sum_{t=1}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial \tilde{C}^{(t)}} \frac{\partial \tilde{C}^{(t)}}{\partial U_{c}} \\ &= \sum_{t=1}^{T} dC^{(t)} i^{(t)} \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \ {}^t h^{(t-1)} \tag{8.18} \\ dW_{c} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial W_{c}} = \sum_{t=0}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial \tilde{C}^{(t)}} \frac{\partial \tilde{C}^{(t)}} {\partial W_{c}} \\ &= dC^{(0)} i^{(0)} \tanh'(W_{c} x^{(0)} + b_{c}) x^{(0)} + \sum_{t=1}^{T} dC^{(t)} i^{(t)} \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \ {}^t x^{(t)} \tag{8.19} \\ db_{c} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial b_{c}} = \sum_{t=0}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial \tilde{C}^{(t)}} \frac{\partial \tilde{C}^{(t)}} {\partial b_{c}} \\ &= dC^{(0)} i^{(0)} \tanh'(W_{c} x^{(0)} + b_{c}) + \sum_{t=1}^{T} dC^{(t)} i^{(t)} \tanh'(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \tag{8.20} \\ dU_{f} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial U_{f}} = \sum_{t=1}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial f^{(t)}} \frac{\partial f^{(t)}} {\partial U_{f}} \\ &= \sum_{t=1}^{T} dC^{(t)} C^{(t-1)} \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \ {}^t h^{(t-1)} \tag{8.21} \\ dW_{f} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial W_{f}} = \sum_{t=1}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial f^{(t)}} \frac{\partial f^{(t)}} {\partial W_{f}} \\ &= \sum_{t=1}^{T} dC^{(t)} C^{(t-1)} \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \ {}^t x^{(t)} \tag{8.22} \\ db_{f} &\stackrel{\mathrm{def}}{=} \frac{\partial L}{\partial b_{f}} = \sum_{t=1}^{T} dC^{(t)} \frac{\partial C^{(t)}}{\partial f^{(t)}} \frac{\partial f^{(t)}} {\partial b_{f}} \\ &= \sum_{t=1}^{T} dC^{(t)} C^{(t-1)} \sigma'(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \tag{8.23} \end{align} $$