9.5. Back Propagation Through Time in Many-to-Many type

The formulation of the Many-to-Many GRU with dense layers is defined as follows:

$$ \begin{cases} z^{(t)} = \sigma(W_{z} x^{(t)} + U_{z} h^{(t-1)} + b_{z}) \\ r^{(t)} = \sigma(W_{r} x^{(t)} + U_{r} h^{(t-1)} + b_{r}) \\ \hat{h}^{(t)} = \tanh(W x^{(t)} + ( r^{(t)} \odot h^{(t-1)}) U + b) \\ h^{(t)} = (1 - z^{(t)}) \odot h^{(t-1)} + z^{(t)} \odot \hat{h}^{(t)} \\ \hat{y}^{(t)} = V h^{(t)} + c \\ y^{(t)} = g(\hat{y}^{(t)}) \end{cases} \tag{9.17} $$
Fig.9-7: Many-to-Many GRU

9.5.1. Computing the gradients for Back Propagation Through Time

We use the mean squared error (MSE) as the loss function $L$, defined as follows:

$$ L = \sum_{t=0}^{T} \frac{1}{2} (y^{(t)} - Y^{(t)})^{2} \tag{9.18} $$

For convenience, we define $L^{(t)}$, the loss value at time step $t$:

$$ L^{(t)} \stackrel{\mathrm{def}}{=} \frac{1}{2} (y^{(t)} - Y^{(t)})^{2} \tag{9.19} $$

Thus, the loss function $L$ can be represented as follows:

$$ L = \sum_{t=0}^{T} L^{(t)} \tag{9.20} $$

To simplify the following discussion, we define the following expression:

$$ \text{grad}_{dense}^{(t)} \stackrel{\mathrm{def}}{=} \frac{\partial L^{(t)}}{\partial h^{(t)}} \tag{9.21} $$

$ \text{grad}_{dense}^{(t)}$ is the gradient propagated from the dense layer at time step $t$.

Using these expressions, we can build the backward computational graph. Fig.9-8 illustrates the relationship between $h^{(T)}$ and $h^{(T-1)}$.

Fig.9-8: Relationship Between $h^{(T)}$, $h^{(T-1)}$, and Dense Layer Outputs
Computational Graph

For an explanation of computational graphs, see Appendix.

Similar to the case with SimpleRNN, we can derive $dh^{(t)}$ for a many-to-many GRU from expression $(9.7)$ as follows:

$$ dh^{(t)} = \begin{cases} \text{grad}_{dense}^{(t)} & t = T \\ \\ \begin{align} & \text{grad}_{dense}^{(t)} + dh^{(t+1)} (\hat{h}^{(t+1)} - h^{(t)}) \sigma'(W_{z} x^{(t+1)} + U_{z} h^{(t)} + b_{z}) \ {}^t U_{z} \\ & \quad + dh^{(t+1)} z^{(t+1)} \tanh'(W x^{(t+1)} + ( r^{(t+1)} \odot h^{(t)}) U + b) (r^{(t+1)} \ {}^t U) \\ & \quad + dh^{(t+1)} z^{(t+1)} \tanh'(W x^{(t+1)} + ( r^{(t+1)} \odot h^{(t)}) U + b) (h^{(t)} \ {}^t U) \sigma'(W_{r} x^{(t+1)} + U_{r} h^{(t)} + b_{r}) \ {}^t U_{r} \\ & \quad + dh^{(t+1)} (1 - z^{(t+1)}) \end{align} & 0 \le t \lt T \end{cases} \tag{9.22} $$
Note

To avoid confusion, we express the transpose of a vector or matrix $ A $ as $ {}^tA$, instead of $A^{T}$, in this section.

Finally, we can obtain the gradients defined in expressions $(9.8)-(9.16)$, using the $dh^{(t)}$ defined here.