11. Language Models

Language models in Natural Language Processing (NLP) are models that assign probabilities to sequences of words.

They are trained on a large corpus of text and learn the statistical relationships between words. Once trained, language models can be used for various tasks, such as:

  • Machine translation
  • Sentence classification
  • Text summarization
  • Question answering
  • Text generation

Formally, a language model can be represented as follows:

$$ P(w_{1}, w_{2},\ldots, w_{n}) \tag{11.1} $$

where $w_{i}$ is a word, and a sequence of words: $w_{1}, w_{2},\ldots, w_{n}$ is a sentence.

Probability

For an explanation of probability, see Appendix.

By applying the chain rule of probability, which is explained in Appendix 2.4.3, equation $(11.1)$ can be expressed as follows:

$$ P(w_{1}, w_{2},\ldots, w_{n}) = P(w_{1}) P(w_{2}| w_{1}) P(w_{3}| w_{1},w_{2}) \cdots P(w_{n}| w_{1},w_{2},\ldots,w_{n-1}) \tag{11.2} $$

To simplify expressions, we introduce two notations as follows:

$$ \begin{align} w_{m:n} & \stackrel{\mathrm{def}}{=} w_{m},w_{m+1},\ldots,w_{n-1}, w_{n} \qquad (m \lt n) \tag{11.3} \\ w_{\lt n} & \stackrel{\mathrm{def}}{=} w_{1:n-1} = w_{1},w_{2},\ldots,w_{n-1} \tag{11.4} \end{align} $$

Using $(11.3)$ and $(11.4)$, the language model in equation $(11.2)$ can be expressed as follows:

$$ \begin{align} P(w_{1:n}) &= P(w_{1}) P(w_{2}|w_{\lt 2}) P(w_{3}|w_{\lt 3}) P(w_{4}|w_{\lt 4}) \cdots P(w_{n}| w_{\lt n}) \tag{11.5} \\ &= P(w_{1}) \prod_{i=2}^{n} P(w_{i}| w_{\lt i}) \tag{11.6} \end{align} $$

No need to fear these abstract expressions. In the following sections, we will explore two major language models $-$ the n-gram language model and the RNN-based language model $-$ using concrete examples.

N-gram models were the mainstream language models before the deep learning era, and RNN-based models replaced n-gram models and became the foundation of recent deep learning technology.

The subsequent sections explain the following topics: