15.1. Encoder-Decoder Model

The Transformer architecture consists of encoder and decoder modules, each built with multiple layers1.

Fig.15-2: Transformer Model: Encoder and Decoder

Looking inside the encoder and decoder, you will find residual connections. These connections, explained in Section, help prevent the exploding and vanishing gradient problems that can arise when stacking multiple layers.

Fig.15-3: (Left) Diagram Highlighting Residual Connections. (Right) Diagram Without Residual Connections

By simplifying the diagram and removing residual connections, we can see the core structure of each layer: an encoder layer consists of a multi-head attention layer followed by a feed-forward layer, while a decoder layer has two multi-head attention layers and a feed-forward layer.

Once simplified like this, the Transformer model appears less complex than it initially feels.

The Add and Normalize layers of the residual connection, as well as the linear and Softmax layers of the decoder, are straightforward and require no further explanation here. The following sections will explore these components in detail:

  1. The original Transformer paper utilized six encoder and six decoder layers in its model. ↩︎