Course:CPSC522/Probability
Contents
Probability
Probability is a measure of the likelihood that something is to happen. Many events can’t be predicted with total certainty. The best we can say is how likely they are to happen, using the idea of probability, which is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty). The higher the probability of an event, the more certain the event will occur. A classic example is the toss of a coin. When a coin is tossed, there are two possible outcomes, heads or tails. Since those two outcomes are equally probable, we can say that the probability of “heads” equals the probability of “tails”, so the probability is ½ chance of either “heads” or “tails”. There are two categories of probability problems, the finite case and infinite case, and this page will only consider finite case.
Principal Author: Junyuan
Collaborators: Samprity, Ke Dai, Arthur Sun
Abstract
Decision making in AI often involves accounting for the relative importance of various goals and the likelihood that they will be achieved. A logic is often needed in the form of a domain so that different states are accounted for in a logically exhaustive manner. Probability helps summarize the uncertainty the agent has about the world. The page defines the possible world semantics of probability and conditional probability. To help the reader understand the content, this page will first introduce probability theorems, and give an interesting example  the Dutch Books.
Builds on
Probability is defined in terms of Propositional Logic because it considers the probability of one proposition conditioned on another proposition.
More general than
Probability  general semantics considers the case where there are infinitely many possible worlds either because of realvalued random variables or infinitely many random variables, including paradoxes of probability and solutions to these paradoxes.
Graphical Models considers models that incorporate useful independence assumptions.
Content
Basic Theories
Probability Space
In probability theory, a probability space is a mathematical model of a real world experiment. A probability space is constructed based on a specific kind of experiment. Each time a situation of that experiment arises, the outcome and the probabilities will be the same as legacy situation.
The probability space is formed by three elements:
 A sample space, , which is the set of all possible outcomes.
 A set of events, , where each event is a set containing zero or more outcomes.
 The assignment of probabilities to the events; that is, a function from events to probabilities.
A single execution's result of the mathematical model is called an outcome. Sometimes people groups the outcomes together to form a more complex events, which form may vary, however, we use the to present a set of such events. Finally, there is a need to specify each event's likelihood of happening. This is done using the probability measure function, .
Once this mathematical model is established. Every time that we run this model, it will select a single outcome, , from the sample space . All the event in , which is a set of events, that contain the selected outcome can be notated as “have occurred”. If we perform the former introduced process infinite number of times, the frequencies of the occurrence will be close to the result that calculated by the function .
The next three parts will explain those three elements.
Sample Space
In probability theory, the sample space of an experiment is the set of all possible results of that experiment. People always use to refer to a sample space. A welldefined sample space is one of three basic elements in a probabilistic model, the other two are a welldefined set of possible events and a probability assigned to every event.
Sample spaces can also be divided into two categories, countable or uncountable, here are 3 examples:
 A toss of a coin:
 The number of success in a sequence of n identical and independent trails.
 The lifetime of a light bulb^{[1]}:
For many tests, there maybe more than one sample space available, all of which are depending on what result is more interested to the tester. For instance, when drawing a card from a standard deck of playing cards, one reasonable sample space could be the various rank, for instance (2, 3, 4 …. J, K, A), while another could be the suits (hearts, spades, diamonds, clubs).
Event
As introduced before, the welldefined set of possible events is one of the basic elements in a probabilistic model. In probability theory, an event is a set of outcomes of a test, and a probability will be assigned to those outcomes of an experiment.
Here is an example to explain the possible events: If we assemble a deck of playing cards, and draw a single card from the deck, then the sample space is 52 elements set, for every card is a possible result. However, an event is a set of outcomes of an experiment (a subset of the sample space), which including any singleton set (elementary event, which only contains a single outcome in the sample space), the empty set (with probability zero) and the sample space itself (with probability one). Therefore, the potential events in this example might be:
 The 6 of Club (1 element)
 A Queen (4 elements)
 A Spade (13 elements)
 A card (52 elements)
Defining all subsets of the sample space as events work well when there are only finitely many outcomes, but problems will raise when the sample space is infinite, and this case will not be discussed in this page.
Probability Measure
The third basic elements in a probabilistic model is a probability assigned to every event. The way that people gets the probability is using the probability measure. A probability measure over the spaces is a function from sets of spaces into the nonnegative real number, and it has 2 properties:
 must return results in the unit interval , returning 0 for the empty set and 1 for the entire space.
 If and are disjoint sets of space, then
The probability of an event can be detonated by the following equation:
For instance, given three elements 1, 2 and 3 with probabilities , and , the value assigned to event is:
 .
If we want to measure the probability of proposition , denoted as , in the sample space of , we need to use equation^{[2]} :
Where represents is true in space .
Axioms of Probability
The first part gives a general definition of probability. This part we will introduce three axioms for probability. In the probability theory, the probability of some event , denoted , is usually defined such that satisfies the Kolmogorov axioms, which are named after Andrey Kolmogorov. These assumptions can be summarised as follows: Let be a measure space with ^{[3]}. Then is a probability space, with sample space , event space and probability measure .
First axiom
The probability of an event cannot be a negative number:
where is the event space. In particular, is always finite, in contrast with more general measure theory, which will not be discussed in this page.
Second axiom
The probability that some elementary event in the entire sample space will occur is 1. More specifically, there are no elementary events outside the sample space.
Third axiom
Any countable sequence of disjoint sets satisfies
Consequence
Based on those three axioms, one can deduce three useful rules for probability calculations
 The probability of the empty set:
 Monotonicity:
 The numeric bound^{[4]}:
Those axioms are also valid for finite discrete random variables.
Conditional Probability
In probability theory, a conditional probability measures the probability of an event given that assumption another event has occurred. If the event of and the event is known or assumed to have occurred, the notation of represents ‘the conditional probability of given ’.
For example, the probability that any person has a cough on a given day maybe only 10%, but if we know or assume this person has a fever, then they are much more likely to be coughing, therefore, ‘the conditional probability of coughing given having fever” might be higher than 90%.
According to the Kolmogorov’s definition of conditional probability:
Given two events and from the probability space with , the conditional probability of given is defined as the quotient of the probability of the joint of events and , and the probability of .
Here is an example to explain the conditional probability:
Suppose that somebody rolls 1 sixsided dice, and we have to predict the outcome,
 Let A be the value rolled on the first time
 Let B be the value rolled on the second time
The probability that is
A\B 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
The probability that is
A\B 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
The probability that given that is
A\B 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12
Chain Rule
One of the important usage of the conditional probability equation is to decompose the conjunctions, which permits the calculation of any member of the joint distribution of a set of variables.
To find the value of a joint distribution of , we can use the conditional probability to obtain:
Repeating this process with each final term creates the product:
For instance, by using the chain rule we can calculate the joint probability of ^{[5]}
Bayes’ theorem
In probability theory, Bayes’ theorem describes the probability of an event, as for conditional probability, it cannot be assumed that . This can be an insidious error, even for those who are highly conversant with statistics. However, the relationship between and is given by Bayes' theorem:
By using Bayes’ theorem, an agent will know how to update its belief in a proposition based on a new piece of evidence.
For example: Suppose we want to know a person's probability of having cancer, but we know nothing about him or her. Despite not knowing anything about that person, a probability can be assigned based on the general prevalence of cancer, and we suppose it is 1%, , in this example. This is known as the base rate or prior probability of having cancer.
In order to apply knowledge of that person's age in conjunction with Bayes' Theorem, two additional pieces of information are needed.
 The probability of being 65 years old. Suppose it is 0.2%, .
 The probability that a person with cancer is 65 years old. Suppose it is 0.5%, .
Knowing this, along with the base rate, we can calculate that a person who is age 65 has a probability of having cancer equal to
It may come as a surprise that even though being 65 years old increases the risk of having cancer, that person's probability of having cancer is still very low. This is because the base rate of cancer is low. This illustrates the importance of base rate. Base rate negletleads to serious misinterpretation of statistics; therefore, special care should be taken to avoid such mistakes.
Independence
The definition of Independence in probability theory is: two events are independent if the occurrence of one does not affect the probability of the other. For two events A and B, if and only if their joint probability equals the product of their probabilities:
Why this notation of independence is made clear by rewriting with conditional probabilities:
In reality, the concept of independence extends to dealing with collections of more than two events or random variables.
Example: Suppose that somebody rolls 1 sixsided dice, the event of getting a 6 the first time a die is rolled and the event of getting a 6 the second time are independent. However, the event of getting a 6 the first time and the event that the sum of the numbers seen on the first and second are not independent.
Conditional Independence
As for conditional independence, two random variables and are conditionally independent given , which means given knowledge that occurs, knowledge of provides no information on provability of , and knowledge of will not provide information about the .
In the standard notation, X and Y are conditionally independent given Z if and only if
 or
Here is an example^{[6]} :
A box contains two coins: a regular coin and one fake twoheaded coin (). If a person choose a coin at random and toss it twice, and define the following events.
 : First coin toss results in an .
 : Second coin toss results in an .
 : Regular coin has been selected.
Note that and are not independent, but they are conditionally independent given , for if given , the and will be fixed value.
First we can calculate . Also given the regular coin is selected, we have .
If we want to find , , and , we have to use the law of total probability.
By using the same method we can calculate the .
As for
Note because and are conditional independent given , therefore,
Interesting Example
Traditional: Dutch Book
One of the most interesting usage of probability theory in real life is gambling. Dutch Book is a classic example. Imagining in the gambling game, people can get profit regardless of the outcome of the gamble, and this is the Dutch Book, which is caused by the implied probability is not coherent with the actual probability.
In one example, a bookmaker has offered the following odds and attracted one bet on each horse.
Horse number  Offered odds  Implied probability 
Bet Price  Bookie Pays if Horse Wins 

1  Even  $100  $100 stake + $100  
2  3 to 1 against  $50  $50 stake + $150  
3  4 to 1 against  $40  $40 stake + $160  
4  9 to 1 against  $20  $20 stake + $180  
Total: 1.05  Total: $210  Always: $200 
As we can see from this table, the implied probabilities, i.e. probability of each horse winning, add up to a number greater than 1, which is 105%.
If 4 gamblers bet together to make sure that they can get $200, the total amount of money that they need is $210, hence whichever horse wins in this example, they will loss $10 in the race.
But things will be changed, if horse 4 was withdrawn from the race, and the bookmaker forget to adjust the odds, then the implied probability would add up to 0.95. Under this circumstance, a gambler only need to pay $190 to make sure that he can earn $200, always reap a profit of $10.
With competitive fixedodds gambling being offered electronically, gamblers can sometimes create a Dutch book by selecting the best odds from different bookmakers, which increase their potential to utilize the incoherent probability and get profit regardless of the outcome.
For instance, in the same horse racing game, if another bookmaker offered the following odds:
Horse number  Offered odds  Implied probability 

1  3 to 1 against  
2  Even  
3  4 to 1 against  
4  9 to 1 against 
People can bet horse 1 on bookmaker 2, and bet the reset of horse on bookmaker 1, and the total implied probability will be 0.8, which means if the gambler pays $160, they can earn $200 regardless the final result. To prevent this, the bookmakers should react by adjusting the offered odds in the light of demand, so as to remove the potential profit, and this is the Course:CPSC522/Game Theory.
AI: ContentBased Recommendation Service
Probabilistic model
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector representing some features (independent variables), it assigns to this instance probabilities
for each of possible outcomes or classes.
The problem with the above formulation is that if the number of features is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible.
We, therefore, reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as:
In plain English, using Bayesian probability terminology, the above equation can be written as:
Fruit Example
Let's try it out on an example to increase our understanding:
Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit. We know 3 characteristics about each fruit:
 Whether it is Long
 Whether it is Sweet and
 If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type  Long  Not Long  Sweet  Not Sweet  Yello  Not Yellow  Total 

Banana  400  100  350  150  450  50  500 
Orange  0  300  150  150  300  0  300 
Other Fruit  100  100  150  50  50  150  200 
Total  500  500  650  350  800  200  1000 
We can precompute a lot of things about our fruit collection.
The socalled Prior probabilities:
Probability of Evidence
Probability of Likelihood
, etc
Given a fruit, how can we classify it?
Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit? We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):
By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet, Long, Yellow fruit as likely to be a Banana.
Implement
We implemented a simple machine learning algorithm called the Naïve Bayesian Classifier to use the training data set. The Naïve Bayesian Classifier is a data mining classification algorithm that relies on Bayes’ probability theorem. Naïve Bayesian Classifier was chosen due to its simplicity and lack of complicated iterative parameter, making it useful for large datasets. It is surprisingly accurate as well, being able to make accurate predictions.
Naïve Bayesian Classifier has two phases:
 The first is the training phase, where a finite set of data samples is used to build a classifier.
 The second phase is the classification phase where the training data is used and Bayes’ theorem is used to classify any new data into ones that have been calculated in the training phase.
Each movie is distinguished by its own movie ID. Probabilistic estimates have to be made for gender, age, occupation and ratings. This can be represented by:
Set stands for gender ranging from “Male” and “Female”. Set stands for age ranging from “1,18,20…65”. Set stands for occupation ranging from “0” to “20”. Set stands for the rate of movie ranging from “1” to “5”.
Just like the Fruit Example, we need to calculate all prior probabilities, evidence probability, and likelihood probability, however, for the dataset is huge, therefore we use Apache Hadoop to perform the tesk.
And here is the experiment result:
For movie id=994, here are different probabilities:
id  Rate  Male  Female  Age1  Age18  Age25  Age35  Age45  Age50  Age56 

994  1  1  0  0  0.33  0.33  0.67  0  0  0 
Age65  Occ0  Occ1  Occ2  Occ3  Occ4  Occ5  Occ6  Occ7  Occ8  Occ9 

0  0  0  0  0.33  0  0.33  0  0  0  0 
Occ10  Occ11  Occ12  Occ13  Occ14  Occ15  Occ16  Occ17  Occ18  Occ19  Occ20 

0  0  0  0  0.33  0  0.33  0  0  0  0.01 
For example, to calculate the probability that a Male, 25 years old, with Occupation 20, rated movie 994 with a rating of 1, the probability of each of the attributes will be multiplied to get the net attribute.
Showing that the probability will be low for a year old male with occupation 20.
Whereas for a 35year old male with occupation 3, the probability will be:
Showing a higher likelihood of having rated the movie such a low rating.
The movie in question is Big Night a 1996 Comedy Drama film.
Annotated Bibliography
 ↑ "Lifetime of a light bulb", StackExchange
 ↑ David Poole & Alan Mackworth, "Artificial Intelligence: Foundations of Computational Agents", Chapter 6
 ↑ "Probability Axioms", Wikipedia
 ↑ "Probability Axioms", Wikipedia
 ↑ Norman Fenton, "Chain Rule"
 ↑ Hossein PishroNik, "Introduction to Probability", Chapter 1
To Add
Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.
 Marginalization
