1 / 22

Essential Probability & Statistics

Essential Probability & Statistics. (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Purpose of Prob. & Statistics. Deductive vs. Plausible reasoning

Download Presentation

Essential Probability & Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Purpose of Prob. & Statistics • Deductive vs. Plausible reasoning • Incomplete knowledge -> uncertainty • How do we quantify inference under uncertainty? • Probability: models of random process/experiments (how data are generated) • Statistics: draw conclusions on the whole population based on samples (inference on data)

  3. Basic Concepts in Probability • Sample space: all possible outcomes, e.g., • Tossing 2 coins, S ={HH, HT, TH, TT} • Event: ES, E happens iff outcome in E, e.g., • E={HH} (all heads) • E={HH,TT} (same face) • Probability of Event : 1P(E) 0, s.t. • P(S)=1 (outcome always in S) • P(A B)=P(A)+P(B) if (AB)=

  4. Basic Concepts of Prob. (cont.) • Conditional Probability :P(B|A)=P(AB)/P(A) • P(AB) = P(A)P(B|A) =P(B)P(B|A) • So, P(A|B)=P(B|A)P(A)/P(B) • For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) • Total probability: If A1, …, An form a partition of S, then • P(B)= P(BS)=P(BA1)+…+P(B An) • So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) (Bayes’ Rule)

  5. Interpretation of Bayes’ Rule Hypothesis space: H={H1 ,…,Hn} Evidence: E If we want to pick the most likely hypothesis H*, we can drop p(E) Posterior probability of Hi Prior probability of Hi Likelihood of data/evidence if Hi is true

  6. Random Variable • X: S  (“measure” of outcome) • Events can be defined according to X • E(X=a) = {si|X(si)=a} • E(Xa) = {si|X(si)  a} • So, probabilities can be defined on X • P(X=a) = P(E(X=a)) • P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func) • Discrete vs. continuous random variable (think of “partitioning the sample space”)

  7. An Example • Think of a DNA sequence as results of tossing a 4-face die many times independently • P(AATGC)=p(A)p(A)p(T)p(G)p(C) • A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25 (random model M0) • P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25 • Comparing 2 models • M1: coding regions • M2: non-coding regions • Decide if AATGC is more likely a coding region

  8. Probability Distributions • Binomial: Times of successes out of N trials • Gaussian: Sum of N independent R.V.’s • Multinomial: Getting ni occurrences of outcome i

  9. Parameter Estimation • General setting: • Given a (hypothesized & probabilistic) model that governs the random experiment • The model gives a probability of any data p(D|) that depends on the parameter  • Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” • Generally an optimization problem

  10. Maximum Likelihood Estimator Data: a sequence d with counts c(w1), …, c(wN), and length |d| Model: multinomial M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

  11. Maximum Likelihood vs. Bayesian • Maximum likelihood estimation • “Best” means “data likelihood reaches maximum” • Problem: small sample • Bayesian estimation • “Best” means being consistent with our “prior” knowledge and explaining data well • Problem: how to define prior?

  12. Bayesian Estimator • ML estimator: M=argmax M p(d|M) • Bayesian estimator: • First consider posterior: p(M|d) =p(d|M)p(M)/p(d) • Then, consider the mean or mode of the posterior dist. • p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…,N) : our prior on the model parameters • conjugate = prior can be interpreted as “extra”/“pseudo” data • Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” counts e.g., i= p(wi|REF)

  13. Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Bayesian estimate (|d| ?)

  14. Likelihood: p(D|) D=(c1,…,cN) Prior: p() : prior mode : posterior mode ml: ML estimate Illustration of Bayesian Estimation Posterior: p(|D) p(D|)p() 

  15. Basic Concepts in Information Theory • Entropy: Measuring uncertainty of a random variable • Kullback-Leibler divergence: comparing two distributions • Mutual Information: measuring the correlation of two random variables

  16. Entropy Entropy H(X) measures the average uncertainty of random variable X Example: Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values

  17. Interpretations of H(X) • Measures the “amount of information” in X • Think of each value of X as a “message” • Think of X as a random experiment (20 questions) • Minimum average number of bits to compress values of X • The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit

  18. Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? Intuitively, H(p,q)  H(p), and mathematically,

  19. Kullback-Leibler Divergence D(p||q) What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? Properties: - D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q Relative entropy KL-divergence is often used to measure the distance between two distributions • Interpretation: • Fix p, D(p||q) and H(p,q) vary in the same way • If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood

  20. Cross Entropy, KL-Div, and Likelihood Likelihood: log Likelihood: Criterion for estimating a good model

  21. Mutual Information I(X;Y) Comparing two distributions: p(x,y) vs p(x)p(y) Conditional Entropy: H(Y|X) Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y

  22. What You Should Know • Probability concepts: • sample space, event, random variable, conditional prob. multinomial distribution, etc • Bayes formula and its interpretation • Statistics: Know how to compute maximum likelihood estimate • Information theory concepts: • entropy, cross entropy, relative entropy, conditional entropy, KL-div., mutual information, and their relationship

More Related