# Course:CPSC522/Probability

## Probability

Tossed coin.

Probability is a measure of the likelihood that something is to happen. Many events can’t be predicted with total certainty. The best we can say is how likely they are to happen, using the idea of probability, which is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty). The higher the probability of an event, the more certain the event will occur. A classic example is the toss of a coin. When a coin is tossed, there are two possible outcomes, heads or tails. Since those two outcomes are equally probable, we can say that the probability of “heads” equals the probability of “tails”, so the probability is ½ chance of either “heads” or “tails”. There are two categories of probability problems, the finite case and infinite case, and this page will only consider finite case.

Principal Author: Junyuan
Collaborators: Samprity, Ke Dai, Arthur Sun

## Abstract

Decision making in AI often involves accounting for the relative importance of various goals and the likelihood that they will be achieved. A logic is often needed in the form of a domain so that different states are accounted for in a logically exhaustive manner. Probability helps summarize the uncertainty the agent has about the world. The page defines the possible world semantics of probability and conditional probability. To help the reader understand the content, this page will first introduce probability theorems, and give an interesting example - the Dutch Books.

### Builds on

Probability is defined in terms of Propositional Logic because it considers the probability of one proposition conditioned on another proposition.

### More general than

Probability - general semantics considers the case where there are infinitely many possible worlds either because of real-valued random variables or infinitely many random variables, including paradoxes of probability and solutions to these paradoxes.

Graphical Models considers models that incorporate useful independence assumptions.

## Content

### Basic Theories

#### Probability Space

In probability theory, a probability space is a mathematical model of a real world experiment. A probability space is constructed based on a specific kind of experiment. Each time a situation of that experiment arises, the outcome and the probabilities will be the same as legacy situation.

The probability space is formed by three elements:

1. A sample space, ${\displaystyle \Omega }$, which is the set of all possible outcomes.
2. A set of events, ${\displaystyle {\mathcal {F}}}$, where each event is a set containing zero or more outcomes.
3. The assignment of probabilities to the events; that is, a function ${\displaystyle P}$ from events to probabilities.

A single execution's result of the mathematical model is called an outcome. Sometimes people groups the outcomes together to form a more complex events, which form may vary, however, we use the ${\displaystyle {\mathcal {F}}}$ to present a set of such events. Finally, there is a need to specify each event's likelihood of happening. This is done using the probability measure function, ${\displaystyle P}$.

Once this mathematical model is established. Every time that we run this model, it will select a single outcome, ${\displaystyle \Omega }$, from the sample space ${\displaystyle {\mathcal {F}}}$. All the event in , which is a set of events, that contain the selected outcome can be notated as “have occurred”. If we perform the former introduced process infinite number of times, the frequencies of the occurrence will be close to the result that calculated by the function ${\displaystyle P}$.

The next three parts will explain those three elements.

#### Sample Space

In probability theory, the sample space of an experiment is the set of all possible results of that experiment. People always use ${\displaystyle \Omega }$ to refer to a sample space. A well-defined sample space is one of three basic elements in a probabilistic model, the other two are a well-defined set of possible events and a probability assigned to every event.

Sample spaces can also be divided into two categories, countable or uncountable, here are 3 examples:

1. A toss of a coin: ${\displaystyle \Omega =\{Head,Tail\}}$
2. The number of success in a sequence of n identical and independent trails.
3. The lifetime of a light bulb[1]: ${\displaystyle \Omega =\{x\in \mathbb {R} |x\geq 0\}}$

For many tests, there maybe more than one sample space available, all of which are depending on what result is more interested to the tester. For instance, when drawing a card from a standard deck of playing cards, one reasonable sample space could be the various rank, for instance (2, 3, 4 …. J, K, A), while another could be the suits (hearts, spades, diamonds, clubs).

#### Event

As introduced before, the well-defined set of possible events is one of the basic elements in a probabilistic model. In probability theory, an event is a set of outcomes of a test, and a probability will be assigned to those outcomes of an experiment.

A Venn diagram of an event. A is the sample space and B is an event. By the ratio of their areas, the probability of B is approximately 0.1

Here is an example to explain the possible events: If we assemble a deck of playing cards, and draw a single card from the deck, then the sample space is 52 elements set, for every card is a possible result. However, an event is a set of outcomes of an experiment (a subset of the sample space), which including any singleton set (elementary event, which only contains a single outcome in the sample space), the empty set (with probability zero) and the sample space itself (with probability one). Therefore, the potential events in this example might be:

1. The 6 of Club (1 element)
2. A Queen (4 elements)
4. A card (52 elements)

Defining all subsets of the sample space as events work well when there are only finitely many outcomes, but problems will raise when the sample space is infinite, and this case will not be discussed in this page.

#### Probability Measure

The third basic elements in a probabilistic model is a probability assigned to every event. The way that people gets the probability is using the probability measure. A probability measure over the spaces is a function ${\displaystyle \mu }$ from sets of spaces into the non-negative real number, and it has 2 properties:

1. ${\displaystyle \mu }$ must return results in the unit interval ${\displaystyle {[0,1]}}$, returning 0 for the empty set and 1 for the entire space.
2. If ${\displaystyle \Omega _{1}}$ and ${\displaystyle \Omega _{2}}$ are disjoint sets of space, then ${\displaystyle \mu (\Omega _{1}\cup \Omega _{2})=\mu (\Omega _{1})+\mu (\Omega _{2})}$

The probability of an event ${\displaystyle A}$ can be detonated by the following equation:

${\displaystyle \mathrm {P} (A)={\frac {|A|}{|\Omega |}}}$

For instance, given three elements 1, 2 and 3 with probabilities ${\displaystyle {\frac {1}{4}}}$, ${\displaystyle {\frac {1}{4}}}$ and ${\displaystyle {\frac {1}{2}}}$, the value assigned to event ${\displaystyle \{1,3\}}$ is:

${\displaystyle {\frac {1}{4}}+{\frac {1}{2}}={\frac {3}{4}}}$.

If we want to measure the probability of proposition ${\displaystyle \alpha }$, denoted as ${\displaystyle \mathrm {P} (\alpha )}$, in the sample space of ${\displaystyle \omega }$, we need to use equation[2] :

${\displaystyle \mathrm {P} (\alpha )=\mu (\{\omega :\omega \models \alpha \})}$

Where ${\displaystyle \omega \models \alpha }$ represents ${\displaystyle \alpha }$ is true in space ${\displaystyle \omega }$.

### Axioms of Probability

The first part gives a general definition of probability. This part we will introduce three axioms for probability. In the probability theory, the probability ${\displaystyle \mathrm {P} }$ of some event ${\displaystyle \mathrm {E} }$, denoted ${\displaystyle \mathrm {P} (E)}$, is usually defined such that ${\displaystyle \mathrm {P} }$ satisfies the Kolmogorov axioms, which are named after Andrey Kolmogorov. These assumptions can be summarised as follows: Let ${\displaystyle {(\Omega ,\mathrm {F} ,\mathrm {P} )}}$ be a measure space with ${\displaystyle \mathrm {P} (\Omega )=1}$[3]. Then ${\displaystyle {(\Omega ,\mathrm {F} ,\mathrm {P} )}}$ is a probability space, with sample space ${\displaystyle \Omega }$, event space ${\displaystyle \mathrm {F} }$ and probability measure ${\displaystyle \mathrm {P} }$.

#### First axiom

The probability of an event cannot be a negative number:

${\displaystyle P(E)\in \mathbb {R} ,P(E)\geq 0\qquad \forall E\in F}$

where ${\displaystyle P}$ is the event space. In particular, ${\displaystyle P(E)}$ is always finite, in contrast with more general measure theory, which will not be discussed in this page.

#### Second axiom

The probability that some elementary event in the entire sample space will occur is 1. More specifically, there are no elementary events outside the sample space.

${\displaystyle P(\Omega )=1.}$

#### Third axiom

Any countable sequence of disjoint sets ${\displaystyle E_{1},E_{2}...}$ satisfies

${\displaystyle P\left(\bigcup _{i=1}^{\infty }E_{i}\right)=\sum _{i=1}^{\infty }P(E_{i}).}$

#### Consequence

Based on those three axioms, one can deduce three useful rules for probability calculations

1. The probability of the empty set: ${\displaystyle P(\varnothing )=0.}$
2. Monotonicity: ${\displaystyle \quad {\text{if}}\quad A\subseteq B\quad {\text{then}}\quad P(A)\leq P(B).}$
3. The numeric bound[4]: ${\displaystyle 0\leq P(E)\leq 1\qquad \forall E\in F.}$

Those axioms are also valid for finite discrete random variables.

### Conditional Probability

In probability theory, a conditional probability measures the probability of an event given that assumption another event has occurred. If the event of ${\displaystyle A}$ and the event ${\displaystyle B}$ is known or assumed to have occurred, the notation of ${\displaystyle P(A|B)}$ represents ‘the conditional probability of ${\displaystyle A}$ given ${\displaystyle B}$’.

For example, the probability that any person has a cough on a given day maybe only 10%, but if we know or assume this person has a fever, then they are much more likely to be coughing, therefore, ‘the conditional probability of coughing given having fever” might be higher than 90%.

According to the Kolmogorov’s definition of conditional probability:

Given two events ${\displaystyle A}$ and ${\displaystyle B}$ from the probability space with ${\displaystyle P(B)>0}$, the conditional probability of ${\displaystyle A}$ given ${\displaystyle B}$ is defined as the quotient of the probability of the joint of events ${\displaystyle A}$ and ${\displaystyle B}$, and the probability of ${\displaystyle B}$.

${\displaystyle P(A|B)={\frac {P({A}\cap {B})}{P(B)}}}$

Here is an example to explain the conditional probability:

Suppose that somebody rolls 1 six-sided dice, and we have to predict the outcome,

1. Let A be the value rolled on the first time
2. Let B be the value rolled on the second time

The probability that ${\displaystyle A=3}$ is ${\displaystyle P(A=3)={\frac {1}{6}}.}$

A\B 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

The probability that ${\displaystyle A+B\leq {5}}$ is ${\displaystyle P(A+B\leq {5})={\frac {10}{36}}.}$

A\B 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

The probability that ${\displaystyle A=3}$ given that ${\displaystyle A+B\leq {5}}$ is

${\displaystyle P(A=3|A+B\leq {5})={\frac {P((A=3)\cap {(A+B<=5))}}{P(A+B\leq {5})}}={\frac {\frac {2}{36}}{\frac {10}{36}}}=0.2}$
A\B 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

#### Chain Rule

One of the important usage of the conditional probability equation is to decompose the conjunctions, which permits the calculation of any member of the joint distribution of a set of variables.

To find the value of a joint distribution of ${\displaystyle A_{1},\ldots ,A_{n}}$, we can use the conditional probability to obtain:

${\displaystyle \mathrm {P} (A_{n},\ldots ,A_{1})=\mathrm {P} (A_{n}|A_{n-1},\ldots ,A_{1})\cdot \mathrm {P} (A_{n-1},\ldots ,A_{1})}$

Repeating this process with each final term creates the product:

${\displaystyle \mathrm {P} \left(\bigcap _{k=1}^{n}A_{k}\right)=\prod _{k=1}^{n}\mathrm {P} \left(A_{k}\,{\Bigg |}\,\bigcap _{j=1}^{k-1}A_{j}\right)}$

For instance, by using the chain rule we can calculate the joint probability of ${\displaystyle \mathrm {P} (A_{4},A_{3},A_{2},A_{1})}$[5]

${\displaystyle \mathrm {P} (A_{4},A_{3},A_{2},A_{1})=\mathrm {P} (A_{4}\mid A_{3},A_{2},A_{1})\cdot \mathrm {P} (A_{3}\mid A_{2},A_{1})\cdot \mathrm {P} (A_{2}\mid A_{1})\cdot \mathrm {P} (A_{1})}$

#### Bayes’ theorem

A geometric visualisation of Bayes' theorem. In the table, the values w, x, y and z give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded.

In probability theory, Bayes’ theorem describes the probability of an event, as for conditional probability, it cannot be assumed that ${\displaystyle P(A|B)\approx P(B|A)}$. This can be an insidious error, even for those who are highly conversant with statistics. However, the relationship between ${\displaystyle P(A|B)}$ and ${\displaystyle P(B|A)}$ is given by Bayes' theorem:

${\displaystyle P(A|B)={\frac {P(A|B)P(B)}{P(A)}}}$

By using Bayes’ theorem, an agent will know how to update its belief in a proposition based on a new piece of evidence.

For example: Suppose we want to know a person's probability of having cancer, but we know nothing about him or her. Despite not knowing anything about that person, a probability can be assigned based on the general prevalence of cancer, and we suppose it is 1%, ${\displaystyle P(A)}$, in this example. This is known as the base rate or prior probability of having cancer.

In order to apply knowledge of that person's age in conjunction with Bayes' Theorem, two additional pieces of information are needed.

1. The probability of being 65 years old. Suppose it is 0.2%, ${\displaystyle P(B)}$.
2. The probability that a person with cancer is 65 years old. Suppose it is 0.5%, ${\displaystyle P(B|A)}$.

Knowing this, along with the base rate, we can calculate that a person who is age 65 has a probability of having cancer equal to ${\displaystyle P(A|B)={\frac {P(A|B)P(B)}{P(A)}}={\frac {0.5\%*1\%}{0.2\%}}=2.5\%}$

It may come as a surprise that even though being 65 years old increases the risk of having cancer, that person's probability of having cancer is still very low. This is because the base rate of cancer is low. This illustrates the importance of base rate. Base rate negletleads to serious misinterpretation of statistics; therefore, special care should be taken to avoid such mistakes.

### Independence

The definition of Independence in probability theory is: two events are independent if the occurrence of one does not affect the probability of the other. For two events A and B, if and only if their joint probability equals the product of their probabilities:

${\displaystyle P(A\cap B)=P(A)P(B)}$

Why this notation of independence is made clear by rewriting with conditional probabilities:

${\displaystyle P(A\cap B)=P(A)P(B)\Leftrightarrow P(A)={\frac {P(A)P(B)}{P(B)}}={\frac {P(A\cap B)}{P(B)}}=P(A|B)}$

In reality, the concept of independence extends to dealing with collections of more than two events or random variables.

${\displaystyle P(A_{1}\cap A_{2}\cap \dots A_{n})=P(A_{1})P(A_{2})\dots P(A_{n})}$

Example: Suppose that somebody rolls 1 six-sided dice, the event of getting a 6 the first time a die is rolled and the event of getting a 6 the second time are independent. However, the event of getting a 6 the first time and the event that the sum of the numbers seen on the first and second are not independent.

### Conditional Independence

As for conditional independence, two random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are conditionally independent given ${\displaystyle Z}$, which means given knowledge that ${\displaystyle Z}$ occurs, knowledge of ${\displaystyle Y}$ provides no information on provability of ${\displaystyle X}$, and knowledge of ${\displaystyle X}$ will not provide information about the ${\displaystyle Y}$.

In the standard notation, X and Y are conditionally independent given Z if and only if

${\displaystyle P(X\cap Y|Z)=P(X|Z)P(Y|Z)}$ or ${\displaystyle P(X|Y\cap Z)=P(X|Z)}$

Here is an example[6] :

A box contains two coins: a regular coin and one fake two-headed coin (${\displaystyle P(H)=1}$). If a person choose a coin at random and toss it twice, and define the following events.

1. ${\displaystyle A}$: First coin toss results in an ${\displaystyle H}$.
2. ${\displaystyle B}$: Second coin toss results in an ${\displaystyle H}$.
3. ${\displaystyle C}$: Regular coin has been selected.

Note that ${\displaystyle A}$ and ${\displaystyle B}$ are not independent, but they are conditionally independent given ${\displaystyle C}$, for if given ${\displaystyle C}$, the ${\displaystyle P(A)}$ and ${\displaystyle P(B)}$ will be fixed value.

First we can calculate ${\displaystyle P(A|C)=P(B|C)={\frac {1}{2}}}$. Also given the regular coin is selected, we have ${\displaystyle P(A\cap B|C)={\frac {1}{2}}*{\frac {1}{2}}={\frac {1}{4}}}$.

If we want to find ${\displaystyle P(A)}$, ${\displaystyle P(B)}$, and ${\displaystyle P(A\cap B)}$, we have to use the law of total probability.

{\displaystyle {\begin{aligned}P(A)&=P(A|C)P(C)+P(A|\neg C)P(\neg C)\\[8pt]&={\frac {1}{2}}*{\frac {1}{2}}+1*{\frac {1}{2}}\\[8pt]&={\frac {3}{4}}\end{aligned}}}

By using the same method we can calculate the ${\displaystyle P(B)={\frac {3}{4}}}$.

As for ${\displaystyle P(A\cap B)}$

{\displaystyle {\begin{aligned}P(A\cap B)&=P(A\cap B|C)P(C)+P(A\cap B|\neg C)P(\neg C)\\[8pt]&=P(A|C)P(B|C)P(C)+P(A|\neg C)P(B|\neg C)P(\neg C)\\[8pt]&={\frac {1}{2}}*{\frac {1}{2}}*{\frac {1}{2}}+1*1*{\frac {1}{2}}\\[8pt]&={\frac {5}{8}}\end{aligned}}}

Note because ${\displaystyle A}$ and ${\displaystyle B}$ are conditional independent given ${\displaystyle C}$, therefore, ${\displaystyle P(A\cap B|C)P(C)=P(A|C)P(B|C)P(C)}$

### Interesting Example

One of the most interesting usage of probability theory in real life is gambling. Dutch Book is a classic example. Imagining in the gambling game, people can get profit regardless of the outcome of the gamble, and this is the Dutch Book, which is caused by the implied probability is not coherent with the actual probability.

In one example, a bookmaker has offered the following odds and attracted one bet on each horse.

Horse number Offered odds Implied
probability
Bet Price Bookie Pays
if Horse Wins
1 Even ${\displaystyle {\frac {1}{1+1}}=0.5}$ $100$100 stake + $100 2 3 to 1 against ${\displaystyle {\frac {1}{3+1}}=0.25}$$50 $50 stake +$150
3 4 to 1 against ${\displaystyle {\frac {1}{4+1}}=0.2}$ $40$40 stake + $160 4 9 to 1 against ${\displaystyle {\frac {1}{9+1}}=0.1}$$20 $20 stake +$180
Total: 1.05 Total: $210 Always:$200

As we can see from this table, the implied probabilities, i.e. probability of each horse winning, add up to a number greater than 1, which is 105%.

If 4 gamblers bet together to make sure that they can get $200, the total amount of money that they need is$210, hence whichever horse wins in this example, they will loss $10 in the race. But things will be changed, if horse 4 was withdrawn from the race, and the bookmaker forget to adjust the odds, then the implied probability would add up to 0.95. Under this circumstance, a gambler only need to pay$190 to make sure that he can earn $200, always reap a profit of$10.

With competitive fixed-odds gambling being offered electronically, gamblers can sometimes create a Dutch book by selecting the best odds from different bookmakers, which increase their potential to utilize the incoherent probability and get profit regardless of the outcome.

For instance, in the same horse racing game, if another bookmaker offered the following odds:

Horse number Offered odds Implied
probability
1 3 to 1 against ${\displaystyle {\frac {1}{3+1}}=0.25}$
2 Even ${\displaystyle {\frac {1}{1+1}}=0.5}$
3 4 to 1 against ${\displaystyle {\frac {1}{4+1}}=0.2}$
4 9 to 1 against ${\displaystyle {\frac {1}{9+1}}=0.1}$

People can bet horse 1 on bookmaker 2, and bet the reset of horse on bookmaker 1, and the total implied probability will be 0.8, which means if the gambler pays $160, they can earn$200 regardless the final result. To prevent this, the bookmakers should react by adjusting the offered odds in the light of demand, so as to remove the potential profit, and this is the Course:CPSC522/Game Theory.

#### AI: Content-Based Recommendation Service

Overview of a Recommendation System with the two main entities, the user and items. The recommendation system fil-ters through the item list and gives an output item that matches the user’s preferences
##### Probabilistic model

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector ${\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n})}$ representing some ${\displaystyle n}$ features (independent variables), it assigns to this instance probabilities

${\displaystyle p(C_{k}\vert x_{1},\dots ,x_{n})\,}$

for each of ${\displaystyle K}$ possible outcomes or classes.

The problem with the above formulation is that if the number of features ${\displaystyle n}$ is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible.

We, therefore, reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as:

${\displaystyle p(C_{k}\vert \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \vert C_{k})}{p(\mathbf {x} )}}.\,}$

In plain English, using Bayesian probability terminology, the above equation can be written as:

${\displaystyle {\mbox{posterior}}={\frac {{\mbox{prior}}\times {\mbox{likelihood}}}{\mbox{evidence}}}.\,}$
##### Fruit Example

Let's try it out on an example to increase our understanding:

Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit. We know 3 characteristics about each fruit:

1. Whether it is Long
2. Whether it is Sweet and
3. If its color is Yellow.

This is our 'training set.' We will use this to predict the type of any new fruit we encounter.

Type Long Not Long Sweet Not Sweet Yello Not Yellow Total
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Other Fruit 100 100 150 50 50 150 200
Total 500 500 650 350 800 200 1000

We can pre-compute a lot of things about our fruit collection.

The so-called Prior probabilities:

${\displaystyle P(Banana)=0.5}$
${\displaystyle P(Orange)=0.3}$
${\displaystyle P(OtherFruit)=0.2}$

Probability of Evidence

${\displaystyle P(Long)=0.5}$
${\displaystyle P(Sweet)=0.65}$
${\displaystyle P(Yellow)=0.8}$

Probability of Likelihood

${\displaystyle P(Long|Banana)=0.8}$
${\displaystyle P(Long|Orange)=0.0}$
${\displaystyle P(Yellow|OtherFruit)=0.25}$
${\displaystyle P(NotYellow|OtherFruit)=0.75}$

, etc

Given a fruit, how can we classify it?

Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit? We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):

${\displaystyle P(Banana|Long,Sweet,Yellow)={\frac {P(Long|Banana)P(Sweet|Banana)P(Yellow|Banana)P(Banana)}{P(Long)P(Sweet)P(Yellow)}}={\frac {0.8*0.7*0.9*0.5}{P(Evidence)}}={\frac {0.252}{P(Evidence)}}}$
${\displaystyle P(Orange|Long,Sweet,Yellow)=0}$
${\displaystyle P(OtherFruit|Long,Sweet,Yellow)={\frac {0.01875}{P(Evidence)}}}$

By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet, Long, Yellow fruit as likely to be a Banana.

##### Implement

We implemented a simple machine learning algorithm called the Naïve Bayesian Classifier to use the training data set. The Naïve Bayesian Classifier is a data mining classification algorithm that relies on Bayes’ probability theorem. Naïve Bayesian Classifier was chosen due to its simplicity and lack of complicated iterative parameter, making it useful for large datasets. It is surprisingly accurate as well, being able to make accurate predictions.

Naïve Bayesian Classifier has two phases:

1. The first is the training phase, where a finite set of data samples is used to build a classifier.
2. The second phase is the classification phase where the training data is used and Bayes’ theorem is used to classify any new data into ones that have been calculated in the training phase.

Each movie is distinguished by its own movie ID. Probabilistic estimates have to be made for gender, age, occupation and ratings. This can be represented by:

${\displaystyle P(gender=G_{i}|rank=k),P(age=A_{i}|rank=k),P(occupation=O_{i}|rank=k),andP(rank)}$

Set ${\displaystyle G_{i}}$ stands for gender ranging from “Male” and “Female”. Set ${\displaystyle A_{i}}$ stands for age ranging from “1,18,20…65”. Set ${\displaystyle O_{i}}$ stands for occupation ranging from “0” to “20”. Set ${\displaystyle rank}$ stands for the rate of movie ranging from “1” to “5”.

Just like the Fruit Example, we need to calculate all prior probabilities, evidence probability, and likelihood probability, however, for the dataset is huge, therefore we use Apache Hadoop to perform the tesk.

And here is the experiment result:

For movie id=994, here are different probabilities:

id Rate Male Female Age1 Age18 Age25 Age35 Age45 Age50 Age56
994 1 1 0 0 0.33 0.33 0.67 0 0 0
Age65 Occ0 Occ1 Occ2 Occ3 Occ4 Occ5 Occ6 Occ7 Occ8 Occ9
0 0 0 0 0.33 0 0.33 0 0 0 0
Occ10 Occ11 Occ12 Occ13 Occ14 Occ15 Occ16 Occ17 Occ18 Occ19 Occ20
0 0 0 0 0.33 0 0.33 0 0 0 0.01

For example, to calculate the probability that a Male, 25 years old, with Occupation 20, rated movie 994 with a rating of 1, the probability of each of the attributes will be multiplied to get the net attribute.

${\displaystyle P(rate=1)=P(Male)*P(Age25)*P(Occ20)=1*0.33*0.01=0.0033}$

Showing that the probability will be low for a year old male with occupation 20.

Whereas for a 35-year old male with occupation 3, the probability will be:

${\displaystyle P(rate=1)=P(Male)*P(Age35)*P(Occ3)=1*0.67*0.33=0.2211}$

Showing a higher likelihood of having rated the movie such a low rating.

The movie in question is Big Night a 1996 Comedy Drama film.

## Annotated Bibliography

1. "Lifetime of a light bulb", StackExchange
2. David Poole & Alan Mackworth, "Artificial Intelligence: Foundations of Computational Agents", Chapter 6
3. "Probability Axioms", Wikipedia
4. "Probability Axioms", Wikipedia
5. Norman Fenton, "Chain Rule"
6. Hossein Pishro-Nik, "Introduction to Probability", Chapter 1