# 2.4. Probability

Probability is one of the mathematical concepts that many people are not good at. However, to understand Transformer model, only simple knowledge of discrete probability is required.

### 2.4.1. Probability

We learned in elementary school that the probability of rolling a $1$ on a standard die 🎲 is $1/6$. This is because a die has six sides, and we assume that all outcomes are equally likely.

Now, we represent that $\Omega$ as a finite set that contains all outcomes, and $x$ as an element of the set $\Omega$. In the above example, $\Omega = $ { $ x | x = 1, 2, 3, 4, 5, 6$ }.

Generally, if all outcomes in a finite set $\Omega$ are equally likely, the probability of $x$ is the number of outcomes in $x$ divided by the total number of outcomes:

$$ P(x) = \frac{|x|}{|\Omega|} $$When $x = 1$, $P(x = 1)$ can be obtained as follows:

$$ P(x = 1) = \frac{|\{1\}|}{|\{1,2,3,4,5,6\}|} = \frac{1}{6} $$##### 2.4.1.1. Empirical Probability

In real-world scenarios, outcomes may not always be equally likely. The probability of an event can only be obtained by observation.

The probability based on observation is called the *empirical probability*.
The empirical probability of an event is calculated by dividing the number of times the event occurred by the total number of trials:

I show an example using a virtual dice.

Here is the result of tossing this dice 30,000 times:

```
$ python Dice.py
========================================
P(Event) = |event|/|trial| = Probability
----------------------------------------
P(1) => 5129 / 30000 = 0.17097
P(2) => 4987 / 30000 = 0.16623
P(3) => 5036 / 30000 = 0.16787
P(4) => 5064 / 30000 = 0.16880
P(5) => 4940 / 30000 = 0.16467
P(6) => 4844 / 30000 = 0.16147
----------------------------------------
```

Using the result, we can calculate the probabilities of all possible events.

$$ \begin{align} \text{P(x = 1)} = \frac{5129}{30000} = 0.17097 \\ \text{P(x = 2)} = \frac{4987}{30000} = 0.16623 \\ \text{P(x = 3)} = \frac{5036}{30000} = 0.16787 \\ \text{P(x = 4)} = \frac{5064}{30000} = 0.16880 \\ \text{P(x = 5)} = \frac{4940}{30000} = 0.16467 \\ \text{P(x = 6)} = \frac{4844}{30000} = 0.16147 \end{align} $$### 2.4.2. Conditional Probability

In probability theory, conditional probability, denoted by $P(x_{2} | x_{1})$, represents the probability of event $x_{2}$ occurring given that event $x_{1}$ has already happened.

For a finite set $\Omega $ of equally likely outcomes, and events $x_{1}$ and $x_{2}$ represented by subsets $\Omega$, the conditional probability of $x_{2}$ given $x_{1}$ is:

$$ P(x_{2}|x_{1}) = \frac{P(x_{1} \cap x_{2})}{P(x_{1})} \tag{2.1} $$###### Example:

The simplest example is when we toss an ideal dice and get an odd number, then find the probability that it is $1$. In this case, event $x_{1} = $ {$x|x = 1,3,5 $} and event $x_{2} = $ {$x|x = 1$}, so $P(x_{1})$ and $P(x_{1} \cap x_{2})$ are as follows:

$$ \begin{align} P(x_{1}) &= \frac{|x_{1}|}{|\Omega|} = \frac{|\{1,3,5 \}|}{|\{1,2,3,4,5,6 \}|} = \frac{3}{6} \\ P(x_{1} \cap x_{2}) &= \frac{|x_{1} \cap x_{2}|}{|\Omega|} = \frac{| \{1,3,5\} \cap \{1\} |}{|\{1,2,3,4,5,6 \}|} = \frac{| \{1\} |}{|\{1,2,3,4,5,6 \}|} = \frac{1}{6} \end{align} $$Therefore, $P(x_{2}|x_{1})$ is obtained as follows:

$$ P(x_{2}|x_{1}) = \frac{P(x_{1} \cap x_{2})}{P(x_{1})} = \frac{\frac{1}{6}} {\frac{3}{6}} = \frac{1}{3} $$### 2.4.3. Chain Rule

There is also the chain rule in probability, so let’s derive it.

First, to obtain $P(x_{1} \cap x_{2})$, we multiply both sides of the conditional probability formula $(2.1)$ by $P(x_{1})$, so we get the following expression:

$$ P(x_{1} \cap x_{2}) = P(x_{2}|x_{1}) P(x_{1}) \tag{2.2} $$Next, to extend the expression $(2.2)$ to the three-variable version $P(x_{1} \cap x_{2} \cap x_{3})$, we can replace $x_{2}$ with $x_{3}$ and $x_{1}$ with $x_{1} \cap x_{2}$:

$$ \begin{align} P(x_{1} \cap x_{2} \cap x_{3}) &= P(x_{3}| x_{2} \cap x_{1}) P(x_{2} \cap x_{1}) \\ &= P(x_{3}| x_{2} \cap x_{1}) P(x_{2}|x_{1}) P(x_{1}) \end{align} $$By performing similar operations recursively, we can obtain the chain rule shown below:

$$ \begin{align} P(x_{1} \cap x_{2} \cap \cdots \cap x_{n-1} \cap x_{n}) &= P(x_{n}|x_{1} \cap \cdots \cap x_{n-1}) P(x_{n-1}|x_{1} \cap \cdots \cap x_{n-2}) \cdots P(x_{2}|x_{1}) P(x_{1}) \\ &= P(x_{1}) \prod_{i=2}^{n} P(x_{i} | x_{1} \cap \cdots \cap x_{i-1}) \tag{2.3} \end{align} $$To simplify, the following notation is often used:

$$ x_{m},x_{m+1},\ldots,x_{n-1}, x_{n} = x_{m} \ \cap x_{m+1} \ \cap \cdots \cap \ x_{n-1} \cap \ x_{n} \qquad (m \lt n) $$Using the introduced notation, the chain rule can be expressed as follows:

$$ \begin{align} P(x_{1}, x_{2}, \ldots ,x_{n-1}, x_{n}) &= P(x_{n}|x_{1}, x_{2}, \ldots , x_{n-1}) P(x_{n-1}|x_{1}, \ldots , x_{n-2}) \cdots P(x_{2}|x_{1}) P(x_{1}) \\ &= P(x_{1}) \prod_{i=2}^{n} P(x_{i} | x_{1}, \ldots , x_{i-1}) \tag{2.4} \end{align} $$