8.1. Formulation of LSTM

The formulation of the LSTM is defined as follows:

$$ \begin{cases} f^{(t)} = \sigma(W_{f} x^{(t)} + U_{f} h^{(t-1)} + b_{f}) \\ i^{(t)} = \sigma(W_{i} x^{(t)} + U_{i} h^{(t-1)} + b_{i}) \\ \tilde{C}^{(t)} = \tanh(W_{c} x^{(t)} + U_{c} h^{(t-1)} + b_{c}) \\ C^{(t)} = f_{(t)} \odot C^{(t-1)} + i^{(t)} \odot \tilde{C}^{(t)} \\ o^{(t)} = \sigma(W_{o} x^{(t)} + U_{o} h^{(t-1)} + b_{o}) \\ h^{(t)} = o^{(t)} \odot \tanh(C^{(t)}) \end{cases} \tag{8.1} $$

Given that the number of input nodes and hidden nodes are $ m $ and $ h $, respectively, then:

  • $ x^{(t)} \in \mathbb{R}^{m} $ is the input at time $t$.
  • $f^{(t)} \in \mathbb{R}^{h} $ is the forget gate at time $t$.
  • $i^{(t)} \in \mathbb{R}^{h} $ is the input gate at time $t$.
  • $C^{(t)} \in \mathbb{R}^{h} $ is the cell state at time $t$.
  • $o^{(t)} \in \mathbb{R}^{h} $ is the output gate at time $t$.
  • $ h^{(t)} \in \mathbb{R}^{h} $ is the hidden states at time $t$.
  • $W_{i}, W_{f}, W_{o} \in \mathbb{R}^{h \times m} $ are the weight matrices for the input gate, forget gate, and output gate, respectively.
  • $U_{i},U_{f},U_{o} \in \mathbb{R}^{h \times h} $ are the recurrent weight matrices for the input gate, forget gate, and output gate, respectively.
  • $b_{i},b_{f},b_{o} \in \mathbb{R}^{h} $ are the bias vectors for the input gate, forget gate, and output gate, respectively.
  • $W_{i} \in \mathbb{R}^{h \times m} $ is the weight matrix for the cell state.
  • $U_{i} \in \mathbb{R}^{h \times h} $ is the recurrent weight matrix for the cell state.

Fig.8-2 illustrates the LSTM defined above:

Fig.8-2: LSTM Unit

We add a dense layer to extract the final hidden state. This makes it a many-to-one LSTM. See Fig.8-3.

Fig.8-3: Many-to-One LSTM

Given that the number of output nodes is $ n $. Then, the dense layer is defined as follows:

$$ y^{(T)} = g(V h^{(T)} + c) \tag{8.2} $$

where:

  • $ V \in \mathbb{R}^{n \times h} $ is the weight matrix.
  • $ c \in \mathbb{R}^{n} $ is the bias term.
  • $ y^{(T)} \in \mathbb{R}^{n} $ is the output vector.
  • $ g(\cdot) $ is the activation function.