13.2. Implementation

Complete Python code is available at: seq2seq-tf.py

13.2.1. Create Dataset

We create two datasets for training and validation: a training dataset and a validation dataset.

The training dataset includes pairs of source and target language sentences:

source_sentences: Training sentences of the source language.
target_sentences: Training sentences of the target language.

Similarly, the validation dataset includes pairs of source and target language sentences:

source_validate_sentences: Validation sentences of the source language.
target_validate_sentences: Validation sentences of the target language.

# ========================================
# Create Dataset
# ========================================

num_examples = 110000
(
    source_tensor,
    target_tensor,
    source_lang_tokenizer,
    target_lang_tokenizer,
) = lth.load_dataset(num_examples)

(
    source_sentences,
    source_validate_sentences,
    target_sentences,
    target_validate_sentences,
) = train_test_split(source_tensor, target_tensor, test_size=0.1)


BUFFER_SIZE = len(source_sentences)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE // BATCH_SIZE

dataset = tf.data.Dataset.from_tensor_slices(
    (source_sentences, target_sentences)
).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

13.2.2. Create Model

Our model consists of two components: an encoder and a decoder.

# ========================================
# Create model
# ========================================

embedding_dim = 256
units = 1024  # LSTM/GRU dimensionality of the output space.

encoder = Encoder(source_lang_tokenizer.vocab_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(target_lang_tokenizer.vocab_size, embedding_dim, units, BATCH_SIZE)

optimizer = tf.compat.v1.train.AdamOptimizer()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(reduction='none')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0)) # this masks '<pad>'
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

The loss_function() has been explained in Section 13.3.1.2.

13.2.2.1. Encoder

The decoder is almost the same architecture as the Sentiment Analysis model, discussed in Section 12.3.2, but not comprise dense layer.

Word Embedding Layer: The inputs $x$ passed through the word embedding layer before being fed into the GRU unit.
Many-to-One GRU Layer: We set $\text{return_sequences}=\text{False}$ and $\text{return_state}=\text{False}$. This returns the final hidden state ($\text{output}$).

#
# Encoder
#
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(
            self.enc_units,
            return_sequences=False,
            return_state=False,
            recurrent_initializer="glorot_uniform",
        )

    def call(self, x):
        x = self.embedding(x)
        output = self.gru(x)
        return output

13.2.2.2. Decoder

The decoder is almost the same architecture of the language model discussed in Section 11.3.1.1.

The only difference is that it returns both the entire sequence of hidden states ($\text{output}$) and the final state ($\text{state}$), which is used in the translation phase.

Word Embedding Layer: This layer is equivalent to the encoder.
Many-to-Many GRU Layer: We set $\text{return_sequences}=\text{True}$ and $\text{return_state}=\text{True}$ to return the entire states $\text{output}$ and the final state $\text{state}$.
Dense Layer: Its output size equals the size of the vocabulary, providing probabilities for each word in the vocabulary.

#
# Decoder
#
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(
            self.dec_units,
            return_sequences=True,
            return_state=True,
            recurrent_initializer="glorot_uniform",
        )
        self.softmax = tf.keras.layers.Dense(vocab_size, activation="softmax")

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(inputs=x, initial_state=hidden)
        output = self.softmax(output)
        return output, state

13.2.3. Training

The training process involves the following steps:

The encoder processes the source-language sentence, and then passes the final hidden state as the context vector to the decoder.
The decoder takes the target-language sentence (excluding the last token, denoted as $\text{dec_input}$) and the context vector as inputs. It predicts the next token in the sequence and continues this process to generate a series of predicted tokens (denoted as $\text{predictions}$).
The loss_function() computes the difference between the predictions, $\text{predictions}$, and the expected output sentences (ground truth labels, denoted as $\text{expected_dec_output}$).
Using the loss computed in the previous step, the optimizer adjusts the model’s internal parameters (weights and bias) to minimize the overall loss value.

@tf.function
def train(encoder, decoder, source_sentences, target_sentences, target_lang_tokenizer):

    with tf.GradientTape() as tape:
        context_vector = encoder(source_sentences)

        # Input sentences:           e.g. ['<sos>', 'this', 'is', 'a', 'pen', '.', '<eos>']
        dec_input = target_sentences[:, :-1]
        # Expected output sentences: e.g. ['this', 'is', 'a', 'pen', '.', '<eos>', '<pad>']
        expected_dec_output = target_sentences[:, 1:]

        predictions, _ = decoder(dec_input, context_vector)
        loss = loss_function(expected_dec_output, predictions)
        train_accuracy(expected_dec_output, predictions)

    batch_loss = loss / int(target_sentences.shape[1])
    variables = encoder.variables + decoder.variables
    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss


#
# Set n_epochs at least 20 when you do training.
#
# If n_epochs = 0, this model uses the trained parameters saved in the last checkpoint,
# allowing you to perform machine translation without retraining.
if len(sys.argv) == 2:
    n_epochs = int(sys.argv[1])
else:
    n_epochs = 25


for epoch in range(1, n_epochs + 1):

    total_loss = 0
    train_accuracy.reset_states()

    for (batch, (source_sentences, target_sentences)) in enumerate(dataset):

        batch_loss = train(encoder, decoder, source_sentences, target_sentences, target_lang_tokenizer)
        total_loss += batch_loss

13.2.4. Translation

During the translation phase, we employ a decoding strategy called greedy search.

Similar to the training phase, the encoder processes source sentences and forwards context vectors to the decoder.

However, during translation, the decoder operates in a step-by-step manner:

Initialization: The decoder starts with the context vectors as its initial value and the start-of-sequence token <SOS> as its first input.
Prediction: The decoder then predicts the next token by choosing the one with the highest likelihood based on the current context and previous hidden state.
Iteration: The predicted token becomes the new input for the next step. This process repeats until the decoder generates the end-of-sequence token <EOS>.

# ========================================
# Translation
# ========================================

def evaluate(sentence, encoder, decoder, source_lang_tokenizer, target_lang_tokenizer):

    sentence = lth.preprocess_sentence(sentence)
    inputs = source_lang_tokenizer.tokenize(sentence)

    inputs = tf.compat.v1.keras.preprocessing.sequence.pad_sequences(
        [inputs], maxlen=source_lang_tokenizer.max_length, padding="post"
    )
    inputs = tf.convert_to_tensor(inputs)

    result = ""

    context_vector = encoder(inputs)
    dec_hidden = context_vector
    dec_input = tf.expand_dims([target_lang_tokenizer.word2idx["<SOS>"]], 0)

    for t in range(target_lang_tokenizer.max_length):
        #
        # Greedy Search
        #
        predictions, dec_hidden = decoder(dec_input, dec_hidden)
        predicted_id = tf.argmax(predictions[0][0]).numpy()
        result += target_lang_tokenizer.idx2word[predicted_id] + " "
        if target_lang_tokenizer.idx2word[predicted_id] == "<EOS>":
            return result

        dec_input = tf.expand_dims([predicted_id], 0)

    return result


def translate(sentence, encoder, decoder, source_lang_tokenizer, target_lang_tokenizer):
    result = evaluate(sentence, encoder, decoder, source_lang_tokenizer, target_lang_tokenizer)
    return result.capitalize()


#
#
keys = np.arange(len(source_validate_sentences))
keys = np.random.permutation(keys)[:10]

for i in range(len(keys)):
    print("===== [{}] ======".format(i + 1))
    sentence = source_lang_tokenizer.detokenize(source_validate_sentences[i], with_pad=False)
    result = translate(sentence, encoder, decoder, source_lang_tokenizer, target_lang_tokenizer)
    print("Input    : {}".format(sentence))
    print("Predicted: {}".format(result))
    print("Correct  : {}".format(target_lang_tokenizer.detokenize(target_validate_sentences[i], with_pad=False)))

13.2.5. Demonstration

Following 25 epochs of training, here are some examples of our model’s translation outputs:

$ python seq2seq-tf.py

===== [1] ======
Input    :  te dije que tom estaba listo.
Predicted: I told tom that i could . <eos>
Correct  :  i told you tom was ready.
===== [2] ======
Input    :  el no tiene computador.
Predicted: He doesn't have any enemies . <eos>
Correct  :  he doesn't have a computer.
===== [3] ======
Input    :  eso es muy grande.
Predicted: That's very good . <eos>
Correct  :  that's very big.
===== [4] ======
Input    :  ¿ tienes algo que hacer manana a esta hora ?
Predicted: Do you have a pen time to anyone ? <eos>
Correct  :  what will you be doing at this time tomorrow ?
===== [5] ======
Input    :  ¿ hay algo que deberia saber ?
Predicted: Is there anything ? <eos>
Correct  :  is there anything i should know ?
===== [6] ======
Input    :  yo solo queria otra oportunidad.
Predicted: I knew him to read . <eos>
Correct  :  i just wanted another chance.
===== [7] ======
Input    :  ¿ que hay de nosotros ?
Predicted: What is happening ? <eos>
Correct  :  what about us ?
===== [8] ======
Input    :  todos querian que lo hiciera.
Predicted: We should consider pause . <eos>
Correct  :  everybody wanted me to do it.
===== [9] ======
Input    :  ¿ quien es ?
Predicted: Who is that ? <eos>
Correct  :  who is that ?
===== [10] ======
Input    :  ¿ tu no tienes calor ?
Predicted: Aren't your friend ? <eos>
Correct  :  aren't you hot ?

Other examples:

===== [1] ======
Input    :  su voz suena muy bello.
Predicted: Your nose is a cup . <eos>
Correct  :  her voice sounds very beautiful.
===== [2] ======
Input    :  no nos gusta la lluvia.
Predicted: You don't know the truth . <eos>
Correct  :  we don't like rain.
===== [3] ======
Input    :  nos gusta la lluvia.
Predicted: We all the rules . <eos>
Correct  :  we like rain.

Info

This code contains the checkpoint function that preserves the training progress. Hence, once trained, the task can be executed without retraining by setting the parameter $\text{n_epochs}$ to $0$, or simply passing $0$ when executing the Python code, as shown below:

$ python seq2seq-tf.py 0

While the model makes some attempts to translate sentences, the results are not yet accurate. This is partly due to the limitations of greedy search, a simple decoding strategy used here. From today’s perspective, this mechanism is overly simplistic and requires further development.

In the next chapter, we will explore the attention mechanism, a key technique in modern AI that can significantly improve translation accuracy. Following that, Part 4 will delve into Transformer models.