10.1. Datasets

This document utilizes three datasets:

  • datasets for machine translation
  • datasets for language models
  • datasets for sentence classification

10.1.1. Dataset for Machine Translation

This document utilizes the spa-eng.zip dataset for Spanish-English machine translation.

Info

You do not need to download this file. It will be automatically downloaded when you first run the machine-translated Python code (seq2seq-tf.py or seq2seq-tf-attention.py).

The repository for this document also contains this file.

The dataset contains 118,964 parallel Spanish-English sentence pairs in plain text format.

$ wc -l ~/.keras/datasets/spa-eng/spa.txt
  118964 /Users/hironobu/.keras/datasets/spa-eng/spa.txt

Here are 10 lines from the file:

$ head -35525  ~/.keras/datasets/spa-eng/spa.txt  | tail -10
He traveled on business.	Él hizo un viaje de negocios.
He turned down my offer.	Él ha rechazado mi oferta.
He turned off the light.	Él apagó la luz.
He twirled his mustache.	Él se retorció el bigote.
He unbuttoned his shirt.	Se desabotonó la camisa.
He used to get up early.	Él solía levantarse temprano.
He wanted to destroy it.	Él quería destruirlo.
He wants a book to read.	Quiere un libro para leer.
He wants me to help him.	Él quiere que lo ayude.
He wants to talk to you.	Quiere hablar con usted.

10.1.2. Datasets for Language Models

To build the n-gram and RNN-based language models, we use several small English corpora extracted from the spa-eng.zip dataset mentioned earlier.

These corpora are available in the repository associated with this document.

$ ls -1 DataSets/small_vocabulary_sentences/eng-*txt
DataSets/small_vocabulary_sentences/eng-14.txt
DataSets/small_vocabulary_sentences/eng-150.txt
DataSets/small_vocabulary_sentences/eng-200.txt
DataSets/small_vocabulary_sentences/eng-250.txt
DataSets/small_vocabulary_sentences/eng-300.txt
DataSets/small_vocabulary_sentences/eng-41.txt
DataSets/small_vocabulary_sentences/eng-99.txt

For demonstration purposes, I will show the smallest dataset, which has a vocabulary of only 14 words.

$ cat DataSets/small_vocabulary_sentences/eng-14.txt
Go on in.
I can go.
Go for it.
I like it.
Tom is in.
Tom is up.
We can go.
I like Tom.
I like him.
I like you.
You like Tom.

10.1.3. Dataset for Sentence Classification

This document utilizes sentiment analysis as an example of a sentence classification task.

While the dataset details are fully explained in Section 12.2, only a few lines of the dataset will be displayed here for reference.

$ head -5 ../DataSets/sentiment_labelled_sentences/yelp_labelled.txt
Wow... Loved this place.	1
Crust is not good.	0
Not tasty and the texture was just nasty.	0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.	1
The selection on the menu was great and so were the prices.	1