10.2. Language translation helper modules

This section introduces two helper modules for language processing tasks in this document:

  • LanguageTranslationHelper.py
    This module is specifically designed for machine translation, offering functionalities for both source and target languages.

  • Tokenizer.py
    This module is a subset of the LanguageTranslationHelper.py module, focused on language models and sequence classification tasks. It only provides minimum functionalities for a single language.

These modules encapsulate essential components like dictionaries, preprocessors, and tokenizers.

The following sections provide a brief overview of the LanguageTranslationHelper.py module.

10.2.1. load_dataset() function

The load_dataset() function reads a dataset file and returns four key components:

  • source_sentences: A list of tokenized data in the source language, already converted into a numerical representation.
  • target_sentences: A list of tokenized data in the target language, similarly converted into a numerical format.
  • source_lang_tokenizer: A class containing methods related to the source language tokenizer. This includes functionalities, such as dictionary and tokenizer methods.
  • target_lang_tokenizer: Similar to source_lang_tokenizer, but for the target language.

Here’s an example of how to use the function:

from Common import LanguageTranslationHelper as lth

num_examples = 110000
) = lth.load_dataset(num_examples)

10.2.2. source_sentences and target_sentences

As previously explained, source_sentences and target_sentences are lists of tokenized sentences in the source and target languages, respectively.

To ensure all sentences have the same length, special padding characters (<pad>) are added to the ends.


Consider the 20th sentence in target_sentences: “Wait .”

When the maximum sentence length of target_sentences is 6, the sentence “Wait.” is tokenized as follows:

target_sentences[20] = [14 12  2 15  0  0] # <SOS> Wait. <EOS> <pad> <pad>

10.2.3. source_lang_tokenizer and target_lang_tokenizer

The classes source_lang_tokenizer and target_lang_tokenizer provide dictionary and tokenizer methods, for the source and target language tokenizers, respectively. Dictionaries: idx2word[], word2idx[]

The dictionaries source_lang_tokenizer.idx2word and target_lang_tokenizer.idx2word map token IDs to their corresponding tokens in the source and target datasets, respectively.

The following Python code shows the vocabulary of a small target dataset:

>> for i in range(len(target_lang_tokenizer.idx2word)):
>>     print("{:>2} {}".format(i, target_lang_tokenizer.idx2word[i]))
 0 <pad>
 1 !
 2 .
 3 ?
 4 fire
 5 go
 6 help
 7 hi
 8 jump
 9 on
10 run
11 stop
12 wait
13 who
14 <SOS>
15 <EOS>

The dictionaries source_lang_tokenizer.word2idx and target_lang_tokenizer.word2idx map tokens to their corresponding token IDs in the source and target datasets, respectively.

Below is an example showcasing the use of word2idx and idx2word dictionaries:

>> word = 'hi'
>> idx = target_lang_tokenizer.word2idx[word]
>> print("\ntarget_lang_tokenizer.word2idx['{}'] = {}".format(word, idx))
>> print("target_lang_tokenizer.idx2word[{}] = {}".format(idx, target_lang_tokenizer.idx2word[idx]))
target_lang_tokenizer.word2idx['hi'] = 7
target_lang_tokenizer.idx2word[7] = hi Tokenizer: tokenize(), detokenize()

The tokenize() method converts a plain text sentence into a numerical sequence.

The detokenize() method converts a numerical sequence back into a plain text sentence.

>> sentence = '<SOS> wait . <EOS> <pad> <pad>'
>> token = target_lang_tokenizer.tokenize(sentence)
>> print("sentence:\"{}\" => {}".format(sentence, token))

sentence:"<SOS> wait . <EOS> <pad> <pad>" => [14, 12, 2, 15, 0, 0]

>> token = [14, 12, 2, 15]
>> sentence = target_lang_tokenizer.detokenize(token, with_pad=False, with_sos=True)
>> print("token:{} => \"{}\"".format(token, sentence))

token:[14, 12, 2, 15, 0, 0] => "<SOS> Wait. <EOS>"

10.2.4. Preprocessor

This module provides a preprocessor function named preprocess_sentence() to prepare text for further processing.

def preprocess_sentence(w, no_tags=False):
    def __unicode_to_ascii(s):
        return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn")

    w = __unicode_to_ascii(w.lower().strip())
    w = re.sub(r"([?.!,┬┐])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    w = re.sub(r"[^a-zA-Z0-9'?.!,┬┐]+", " ", w)
    w = w.rstrip().strip()
    if no_tags == False:
        w = "<SOS> " + w + " <EOS>"
    return w