CS345
Machine Learning
Prof. Alvarez
Notes on Probability and Maximum Likelihood
Probability as a
Representational Formalism
Many situations are uncertain:
Will it rain tonight?
Will a senior be continuously employed for five years after he/she graduates?
Probability allows us to assign degrees of certainty to uncertain situations, and to reason about how the level of certainty changes based on new knowledge.
I may believe there is a 10% chance that it will rain tonight.
If I see that it is cloudy outside, I may increase my belief that rain is imminent.
Machine learning systems can use probability to model uncertainty, and to make rational choices in the presence of uncertainty.
Also, probability provides a new way in which to understand machine learning techniques, even ones that were not originally cast in terms of probability.
We will begin by reviewing the basics of probability theory (Bayes’ Theorem in particular), and will then discuss probabilistic ideas in machine learning, including:
Maximum A Posteriori (MAP) Classification
Maximum Likelihood Models
Naïve Bayes Models
Application: Text Classification
Probability
A random event, or Boolean random variable, is an event that is uncertain to occur.
Examples of random events E:
E = it will rain tonight
E = a job applicant will get the job that he/she is applying for
The probability of a random event is the fraction of all possible worlds in which the event does occur.
The probability P(E) of an event E may be represented graphically as the relative area of E within the space U of all possible worlds:

Here, E occupies about 20% of the universe U, so the probability P(E) of E is about 20%.
Probability Axioms
0 <= P(E) <= 1 (since probabilities are fractions)
P(U) = 1 (the universe takes up 100% of itself)
P(E or F) = P(E) + P(F) if E and F cannot occur simultaneously

Some Consequences of the Axioms
P(E – F) = P(E) – P(E and F)
This is because E is the disjoint union of the sets E-F and (E and F),
so P(E) = P(E – F) + P(E and F) by the disjoint union axiom.
P(E or F) = P(E) + P(F) – P(E and F)
Visually, this is because the overlap E and F is counted twice:

Using the axioms:
E or F is the union of the non-overlapping sets E and (F-E),
so P(E or F) = P(E) + P(F-E) by the disjoint union axiom.
Also, we now know that P(F-E) = P(F) – P(E and F), so
finally P(E or F) = P(E) + P(F) – P(E and F).
Conditional Probability
Probabilities may be revised based on contextual information. This idea can be formalized using the concept of conditional probability.
The conditional probability of an event E given an event F is defined as:
P(E | F) = P(E and F)/P(F)
This is equivalent to viewing F as the new universe of possible worlds.
The value P(E|F) is just the fraction of F that is covered by E.
E.g. if 10% of all days are rainy, our initial belief about a given day can be expressed as
P(rain) = 0.10
However, observing clouds may radically alter our belief:
P(rain | clouds) = 0.50
A random event E is independent of a random event F if knowing whether F occurs or not does not affect the probability that E occurs:
P(E | F) = P(E)
(the conditional probability of E given F equals the prior probability of E)
Graphically, this happens when the fraction of F covered by E is the same as the fraction of the universe U covered by E.
Example:
Successive tosses of a coin may be considered to be independent.
Any measurement process that involves resetting the system before observing the result yields independent measurement values.
Chain
Rule
A basic fact about conditional probabilities is the Chain Rule:
P(E and F) = P(F)*P(E|F)
This rule follows directly from the definition of conditional probability.
The chain rule may be applied successively in order to express the joint probability of a collection of events E1…En:
P(E1 and E2 and … and En) =
= P(E1)*P(E2|E1)*P(E3 | E1 and E2)*…*P(En | E1 and E2 and … and En-1)
Bayes’
Theorem
Conditional probability allows one to express the likelihood that certain observations (consequences) will occur under the assumption that a particular hypothesis holds
For example, consider how the results of a lab test might depend on whether the patient has a certain disease or not:
P(high level of X in blood | healthy) = 0.05
P(high level of X in blood | sick) = 0.90
One may be interested in reversing the direction of the above pseudo-rules so as to gauge the likelihood that a patient is sick or healthy given the patient’s lab results.
The chain rule may be re-expressed in a way that does just this: it allows reversing probabilistic implications in order to get at the causes based on the consequences.
This result is known as Bayes’ Theorem:
P(E | F) = P(F | E)*P(E)/P(F)
Bayes’
Theorem Yields Causes From Effects
For a hypothesis (cause) h and an observation (effect) obs, Bayes’ Theorem states:
P(h | obs) = P(obs | h)*P(h) / P(obs)
This yields the posterior probability of the hypothesis given the observation in terms of the prior probability of the hypothesis and the observation.
The prior probability P(h) is assumed to be known for each hypothesis h.
We also assume that the set of all hypotheses covers all of the universe and that no two hypotheses can occur simultaneously. Under these conditions, the set of possible worlds in which obs is observed is a union of the non-overlapping pieces (obs and h’) for all hypotheses h’, so that the prior probability P(obs) may be expressed as a sum:
P(obs) = Σ P(obs and h’) (summed over all hypotheses h’)
Example
using Bayes’ Theorem
In the lab test example, suppose that we have the prior probabilities:
P(sick) = .03 P(healthy) = .97
Recall also the conditional probability values:
P(high level of X in blood | healthy) = 0.05
P(high level of X in blood | sick) = 0.90
If a patient’s lab test comes back positive (level of X in blood is high),
how likely is it that the patient is sick?
P(sick | high level of X in blood) = P(high level of X | sick)*P(sick)/P(high X)
The denominator requires summing over the sick and healthy hypotheses:
P(high X) = P(high X and sick) + P(high X and healthy)
= P(high X | sick)*P(sick) + P(high X | healthy)*P(healthy)
= 0.90*0.03 + 0.05*0.97
= .027 + .0485
= .0755
We can now compute the conditional probability that the patient is sick:
P(sick | high level of X in blood) = 0.90*0.03/0.0755 = 0.027/0.0755
In passing, note that this probability is less than 0.5, so that even with a positive lab result the patient is more likely to be healthy than sick
Maximum
A Posteriori Classification
The Maximum A Posteriori (MAP) classifier chooses the hypothesis with the largest posterior probability as determined via Bayes’ Theorem:
h_MAP = argmax P(h’ | obs)
In the preceding example, the MAP classifier predicts that a patient with positive lab test results is healthy. Further tests should be done in order to finalize a diagnosis.
Maximum
Likelihood Classification
The Maximum Likelihood Classifier chooses the hypothesis for which the conditional probability of the observation given the hypothesis is maximized
h_ML = argmax P(obs | h’)
This is equivalent to a MAP classifier if one assumes that the prior probabilities of all the hypotheses are equal to one another:
P(h) = P(h’) for all hypotheses h and h’
For example, the maximum likelihood classifier labels a patient with a positive lab result as sick (even though he/she is really more likely to be healthy, as MAP analysis shows).
Nonetheless, ML may not be a bad idea if the prior probabilities are unknown and there is no reason to believe that they are unbalanced.
Discrete
Random Variables with Many Possible Values
Random events may be seen as random variables with the two values true (event occurs) and false (event does not occur).
Random variables X with more than two possible values are also common:
X = the number of Heads in 100 tosses of a coin
X = the temperature in front of your house
The main object associated with a random variable X is its probability distribution, which describes the relative likelihood that the various values of X will be observed.
If the set of possible values of X is discrete, the probability distribution of X is just a list P(X=x1)=p1, P(X=x2)=p2, … P(xn=x)=pn whose i-th element is the probability that the i-th value of X will occur in any given random sampling of X.
Example of the distribution of a discrete random variable:
X = sum of the face values on two fair dice
Probability distribution (values of X in top row, probabilities in bottom row):
2 3 4 5 6 7 8 9 10 11 12
1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Continuous
Random Variables
The set of possible values of a random variable X may also be continuous (real-valued).
X = the precise point where a dart hits a circular target
For a continuous random variable X, the distribution of X cannot give the probability of each possible value of X.
What fraction of the time can you expect a dart to hit a specific point?
Instead, the distribution describes the probability that the observed value of X will fall within a specific range of values.
We assume here that the possible values of X form some set of real numbers.
The probability density of a continuous random variable X is the function f(x) such that the probability that a random sample of X lies in the x-interval a < x < b is the area under the portion of the graph of f(x) comprised between the vertical lines x=a and x=b.
Probability distributions can be described very roughly in terms of summary statistics:
The mean of X is the average observed value: mean(X) = integral x*f(x)*dx
The standard deviation of X measures the spread of X: σ(X) = integral (x-mean(x))2*f(x)*dx
Examples of distributions of continuous random variables:
1) X = the measurement error in a temperature reading due to signal noise
We can model this error as having a Gaussian (“normal”) probability density:
f(x) = (2*pi*σ)^(-1/2) * exp( -x2 / 2σ )
Value of x near 0 (both sides) are most likely; large magnitudes are less likely.
This distribution has mean 0 and standard deviation σ.
2) X = the time spent waiting in line at the checkout counter
This is often modeled in terms of an exponential density for values x >= 0:
f(x) = a*exp(-ax)
Small values of x are more
likely than large values here too (fortunately).
Maximum
Likelihood Interpretation of Least Squares Approximation
As we said, probability provides a way of understanding machine learning techniques
Consider for example neural network learning, which minimizes the mean squared error between the desired and actual network outputs
Consider data with a single attribute x and desired output y.
Assume that the observed output values in the training data are generated by adding noise to a function of the input:
y = f(x) + noise
Suppose furthermore that the noise has a Gaussian distribution with mean 0.
Noise values are randomly chosen so that small values near 0 are most likely, while values with larger magnitudes are less likely, as given by the Gaussian distribution.
Now consider a sequence of labeled training samples:
(x1,f(x1)+noise1), (x2,f(x2)+noise2), … (xn,f(xn)+noisen)
We wish to find a hypothesis (function) h(x) that approximates the true f well.
Let’s choose h to be the maximum likelihood hypothesis.
We must first compute the conditional likelihood of the given training data.
Assuming independence among samples, the likelihood of the training sequence is:
P(data | h) = Product_{i=1…n} P(xi,f(xi)+noisei | h)
But under the hypothesis h, the y and x values are related by the function h, so that
h(xi) = f(xi) +noisei if we assume that hypothesis h holds
Therefore:
P(xi,f(xi)+noisei | h) = P(h(xi)-f(xi) = noisei)
Since we assume that noisei is normally distributed around 0 with standard deviation σ, and since a product of exponentials is the exponential of the sum of the exponents, we have:
P(data | h) = const*exp(-(1/2*σ) sum_{i=1…n} (h(xi)-f(xi))2 )
In order for this to be maximized, the sum of squares in the exponent must be minimized
Thus, the maximum likelihood hypothesis will be the least squares approximation, exactly as say error backpropagation training of a neural network would learn. This shows that a machine learning technique that does not mention probability explicitly (error backpropagation learning for neural nets) can be justified in terms of a probabilistic idea (maximum likelihood classification).
Additional Topics to be Discussed:
Naïve Bayes Models
Application: Text Classification
Bayesian Networks