Intro to Probability

Intro to Probability Slides from Professor Pan,Yan, SYSU

0 1 2 3 X 4 5 6 7 8 Probability Theory Example of a random experiment • We poll 60 users who are using one of two search engines and record the following: Each point corresponds to one of 60 users Two search engines Number of “good hits” returned by search engine

0 1 2 3 X 4 5 6 7 8 Probability Theory Random variables • X and Y are called random variables • Each has its own sample space: • SX = {0,1,2,3,4,5,6,7,8} • SY = {1,2}

3 6 8 8 5 3 1 0 0 60 60 60 60 60 60 60 60 60 0 1 2 3 X 4 5 6 7 8 0 0 0 1 4 5 8 6 2 60 60 60 60 60 60 60 60 60 Probability Theory Probability • P(X=i,Y=j) is the probability (relative frequency) of observing X=i and Y=j • P(X,Y)refers to the whole table of probabilities • Properties: 0 ≤ P≤ 1, SP = 1 P(X=i,Y=j)

P(X) P(Y) 0 1 2 3 X 4 5 6 7 8 Probability Theory Marginal probability • P(X=i) is the marginal probability that X=i, ie, the probability that X = i, ignoring Y

SUM RULE P(X=i) 3 6 8 8 5 3 1 0 0 60 60 60 60 60 60 60 60 60 26 60 P(Y=j) 34 60 0 1 2 3 X 4 5 6 7 8 0 3 6 0 8 0 9 1 9 4 5 8 9 8 6 6 2 2 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 Probability Theory Marginal probability • P(X=i) is the marginal probability that X=i, ie, the probability that X = i, ignoring Y • From the table: P(X=i)=SjP(X=i,Y=j) Note that SiP(X=i) = 1 and SjP(Y=j) = 1

P(X|Y=1) P(Y=1) 0 1 2 3 X 4 5 6 7 8 Probability Theory Conditional probability • P(X=i|Y=j) is the probability that X=i, given that Y=j • From the table: P(X=i|Y=j)=P(X=i,Y=j) /P(Y=j)

3 6 8 8 5 3 1 0 0 60 60 60 60 60 60 60 60 60 0 0 0 1 4 5 8 6 2 3 6 8 9 9 8 9 6 2 3 6 8 8 5 3 1 0 0 3 6 8 9 9 8 9 6 2 0 0 1 1 2 2 3 3 X X 4 4 5 5 6 6 7 7 8 8 0 3 0 6 8 0 9 1 4 9 5 8 8 9 6 6 2 2 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 Probability Theory Conditional probability • How about the opposite conditional probability, P(Y=j|X=i)? • P(Y=j|X=i)=P(X=i,Y=j) /P(X=i) Note that SjP(Y=j|X=i)=1 P(X=i) P(Y=j|X=i) P(X=i,Y=j)

Summary of types of probability • Joint probability: P(X,Y) • Marginal probability (ignore other variable): P(X) and P(Y) • Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

PRODUCT RULE Probability Theory Constructing joint probability • Suppose we know • The probability that the user will pick each search engine, P(Y=j), and • For each search engine, the probability of each number of good hits, P(X=i|Y=j) • Can we construct the joint probability, P(X=i,Y=j)? • Yes. Rearranging P(X=i|Y=j)=P(X=i,Y=j) /P(Y=j) we getP(X=i,Y=j)=P(X=i|Y=j) P(Y=j)

Summary of computational rules • SUM RULE: P(X) = SYP(X,Y) P(Y) = SXP(X,Y) • Notation: We simplify P(X=i,Y=j) for clarity • PRODUCT RULE: P(X,Y) = P(X|Y)P(Y) P(X,Y) = P(Y|X)P(X)

0 1 2 3 X 4 5 6 7 8 Ordinal variables • In our example, X has a natural order 0…8 • X is a number of hits, and • For the ordering of the columns in the table below, nearby X’s have similar probabilities • Y does not have a natural order

Probabilities for real numbers • Can’t we treat real numbers as IEEE DOUBLES with 264 possible values? • Hah, hah. No! • How about quantizing real variables to reasonable number of values? • Sometimes works, but… • We need to carefully account for ordinality • Doing so can lead to cumbersome mathematics

Probability theory for real numbers • Quantize X using bins of width  • Then, X {.., -2, -, 0, , 2, ..} • Define PQ(X=x) = Probability that x X ≤ x+ • Problem: PQ(X=x) depends on the choice of  • Solution: Let  0 • Problem: In that case, PQ(X=x) 0 • Solution: Define a probability density P(x) = lim0PQ(X=x)/ = lim0 (Probability that x X ≤ x+)/

Probability theory for real numbers Probability density • Suppose P(x) is a probability density • Properties • P(x)0 • It is NOT necessary thatP(x)≤1 • xP(x)dx = 1 • Probabilities of intervals: P(aX≤b) = bx=a P(x) dx

y R x Probability theory for real numbers Joint, marginal and conditional densities • Suppose P(x,y) is a joint probability density • x yP(x,y)dx dy = 1 • P( (X,Y)  R) = R P(x,y) dx dy • Marginal density: P(x) =y P(x,y) dy • Conditional density: P(x|y) = P(x,y) / P(y)

The Gaussian distribution sis the standard deviation

Mean and variance • The mean of X is E[X] =SXX P(X) or E[X] =xx P(x)dx • The variance of X is VAR(X) =SX(X-E[X])2P(X) or VAR(X) =x(x - E[X])2P(x)dx • The std dev of X is STD(X) =SQRT(VAR(X)) • The covariance of X and Y is COV(X,Y) = SXSY (X-E[X])(Y-E[Y])P(X,Y) or COV(X,Y) = x y (x-E[X])(y-E[Y])P(x,y) dx dy

Mean and variance of the Gaussian E[X] =  VAR(X) =  2 STD(X) = 

How can we use probability as a framework for machine learning?

L = Maximum likelihood estimation • Say we have a density P(x|q) with parameter q • The likelihood of a set of independent and identically drawn (IDD) data x= (x1,…,xN) is P(x|q) =Pn=1NP(xn|q) • The log-likelihood is L = ln P(x|q) = Sn=1NlnP(xn|q) • The maximum likelihood (ML) estimate of q is q ML = argmaxqL= argmaxqSn=1Nln P(xn|q) • Example: For Gaussian likelihood P(x|q)= N(x|,2),

Comments on notation from now on • Instead ofSj P(X=i,Y=j), we write SX P(X,Y) • P() and p() are used interchangeably • Discrete and continuous variables treated the same, so SX, X, Sx and x are interchangeable • qML and q ML are interchangeable • argmaxqf(q) is the value of q that maximizes f(q) • In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data • N(x|,2) = • log()=ln() and exp(x) = ex • pcontext(x) and p(x|context) are interchangable

Questions?

Example: For Gaussian likelihood P(x|q) = N(x|,2), Objective of regression: Minimize error E(w)= ½Sn( tn- y(xn,w) )2 L = Maximum likelihood estimation

Intro to Probability