10.2. Language translation helper modules
This section introduces two helper modules for language processing tasks in this document:
-
LanguageTranslationHelper.py
This module is specifically designed for machine translation, offering functionalities for both source and target languages. -
Tokenizer.py
This module is a subset of the LanguageTranslationHelper.py module, focused on language models and sequence classification tasks. It only provides minimum functionalities for a single language.
These modules encapsulate essential components like dictionaries, preprocessors, and tokenizers.
The following sections provide a brief overview of the LanguageTranslationHelper.py module.
10.2.1. load_dataset() function
The load_dataset() function reads a dataset file and returns four key components:
- source_sentences: A list of tokenized data in the source language, already converted into a numerical representation.
- target_sentences: A list of tokenized data in the target language, similarly converted into a numerical format.
- source_lang_tokenizer: A class containing methods related to the source language tokenizer. This includes functionalities, such as dictionary and tokenizer methods.
- target_lang_tokenizer: Similar to source_lang_tokenizer, but for the target language.
Here’s an example of how to use the function:
from Common import LanguageTranslationHelper as lth
num_examples = 110000
(
source_sentences,
target_sentences,
source_lang_tokenizer,
target_lang_tokenizer
) = lth.load_dataset(num_examples)
10.2.2. source_sentences and target_sentences
As previously explained, source_sentences and target_sentences are lists of tokenized sentences in the source and target languages, respectively.
To ensure all sentences have the same length, special padding characters (<pad>) are added to the ends.
Example:
Consider the 20th sentence in target_sentences: “Wait .”
When the maximum sentence length of target_sentences is 6, the sentence “Wait.” is tokenized as follows:
target_sentences[20] = [14 12 2 15 0 0] # <SOS> Wait. <EOS> <pad> <pad>
10.2.3. source_lang_tokenizer and target_lang_tokenizer
The classes source_lang_tokenizer and target_lang_tokenizer provide dictionary and tokenizer methods, for the source and target language tokenizers, respectively.
10.2.3.1. Dictionaries: idx2word[], word2idx[]
The dictionaries source_lang_tokenizer.idx2word and target_lang_tokenizer.idx2word map token IDs to their corresponding tokens in the source and target datasets, respectively.
The following Python code shows the vocabulary of a small target dataset:
>> for i in range(len(target_lang_tokenizer.idx2word)):
>> print("{:>2} {}".format(i, target_lang_tokenizer.idx2word[i]))
>>
0 <pad>
1 !
2 .
3 ?
4 fire
5 go
6 help
7 hi
8 jump
9 on
10 run
11 stop
12 wait
13 who
14 <SOS>
15 <EOS>
The dictionaries source_lang_tokenizer.word2idx and target_lang_tokenizer.word2idx map tokens to their corresponding token IDs in the source and target datasets, respectively.
Below is an example showcasing the use of word2idx and idx2word dictionaries:
>> word = 'hi'
>> idx = target_lang_tokenizer.word2idx[word]
>> print("\ntarget_lang_tokenizer.word2idx['{}'] = {}".format(word, idx))
>> print("target_lang_tokenizer.idx2word[{}] = {}".format(idx, target_lang_tokenizer.idx2word[idx]))
>>
target_lang_tokenizer.word2idx['hi'] = 7
target_lang_tokenizer.idx2word[7] = hi
10.2.3.2. Tokenizer: tokenize(), detokenize()
The tokenize() method converts a plain text sentence into a numerical sequence.
The detokenize() method converts a numerical sequence back into a plain text sentence.
>> sentence = '<SOS> wait . <EOS> <pad> <pad>'
>> token = target_lang_tokenizer.tokenize(sentence)
>>
>> print("sentence:\"{}\" => {}".format(sentence, token))
sentence:"<SOS> wait . <EOS> <pad> <pad>" => [14, 12, 2, 15, 0, 0]
>> token = [14, 12, 2, 15]
>> sentence = target_lang_tokenizer.detokenize(token, with_pad=False, with_sos=True)
>> print("token:{} => \"{}\"".format(token, sentence))
>>
token:[14, 12, 2, 15, 0, 0] => "<SOS> Wait. <EOS>"
10.2.4. Preprocessor
This module provides a preprocessor function named preprocess_sentence() to prepare text for further processing.
def preprocess_sentence(w, no_tags=False):
def __unicode_to_ascii(s):
return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn")
w = __unicode_to_ascii(w.lower().strip())
w = re.sub(r"([?.!,¿])", r" \1 ", w)
w = re.sub(r'[" "]+', " ", w)
w = re.sub(r"[^a-zA-Z0-9'?.!,¿]+", " ", w)
w = w.rstrip().strip()
if no_tags == False:
w = "<SOS> " + w + " <EOS>"
return w