1 / 23

Parameter Estimation using likelihood functions Tutorial #1

Parameter Estimation using likelihood functions Tutorial #1. This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai . Changes made by Dan Geiger and Ydo Wexler. Example: Binomial Experiment.

ginata
Download Presentation

Parameter Estimation using likelihood functions Tutorial #1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter Estimation using likelihood functionsTutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler. .

  2. Example: Binomial Experiment • When tossed, it can land in one of two positions: Head or Tail Head Tail • We denote by  the (unknown) probability P(H). Estimation task: • Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 - 

  3. i.i.d. samples Statistical Parameter Fitting • Consider instances x[1], x[2], …, x[M] such that • The set of values that x can take is known • Each is sampled from the same distribution • Each sampled independently of the rest • The task is to find a vector of parameters  that have generated the given data. This vector parameter  can be used to predict future data.

  4. L()  0 0.2 0.4 0.6 0.8 1 The Likelihood Function • How good is a particular ?It depends on how likely it is to generate the observed data • The likelihood for the sequence H,T, T, H, H is

  5. Sufficient Statistics • To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) • NH and NT are sufficient statistics for the binomial distribution

  6. Formally, s(D) is a sufficient statistics if for any two datasets D and D’ • s(D) = s(D’ ) LD() = LD’ () Datasets Statistics Sufficient Statistics • A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood

  7. Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function • This is one of the most commonly used estimators in statistics • Intuitively appealing • One usually maximizes the log-likelihood function defined as lD() = logeLD()

  8. L() 0 0.2 0.4 0.6 0.8 1 (Which coincides with what one would expect) Example: (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6 Example: MLE in Binomial Data • Applying the MLE principle we get

  9. Sufficient statistics: • N1, N2, …, NK - the number of times each outcome is observed Likelihood function: MLE: From Binomial to Multinomial • For example, suppose X can have the values 1,2,…,K (For example a die has 6 sides) • We want to learn the parameters 1, 2. …, K

  10. MLE: Example: Multinomial • Let be a protein sequence • We want to learn the parameters q1, q2,…,q20 corresponding to the frequencies of the 20 amino acids • N1, N2, …, N20 - the number of times each amino acid is observed in the sequence Likelihood function:

  11. Is MLE all we need? • Suppose that after 10 observations, • ML estimates P(H) = 0.7 for the thumbtack • Would you bet on heads for the next toss? • Suppose now that after 10 observations, • ML estimates P(H) = 0.7 for a coin • Would you place the same bet? • Solution: The Bayesian approach which incorporates your subjective prior knowledge. E.g., you may know a priori that some amino acids have the same frequencies and some have low frequencies. How would one use this information ?

  12. Bayes’ rule Bayes’ rule: Where It hold because:

  13. What is the probability the die is loaded? Example: Dishonest Casino • A casino uses 2 kind of dice: 99% are fair 1% is loaded: 6 comes up 50% of the times • We pick a die at random and roll it 3 times • We get 3 consecutive sixes

  14. Dishonest Casino (cont.) The solution is based on using Bayes rule and the fact that while P(loaded | 3sixes) is not known, the other three terms in Bayes rule are known, namely: • P(3sixes | loaded)=(0.5)3 • P(loaded)=0.01 • P(3sixes) = P(3sixes | loaded)P(loaded)+P(3sixes | not loaded) (1-P(loaded))

  15. Dishonest Casino (cont.)

  16. p(ai|int) - the frequency of amino acid ai for intracellular proteins p(ai|ext) - the frequency of amino acid ai for extracellular proteins Biological Example: Proteins • Extracellular proteins have a slightly different amino acid composition than intracellular proteins. • From a large enough protein database (SWISS-PROT), we can get the following: p(int) - the probability that any new sequence is intracellular p(ext) - the probability that any new sequence is extracellular

  17. Biological Example: Proteins (cont.) • What is the probability that a given new protein sequence x=x1x2….xn is extracellular? • Assuming that every sequence is either extracellular or intracellular (but not both) we can write . • Thus,

  18. By Bayes’ theorem Biological Example: Proteins (cont.) • Using conditional probability we get , • The probabilities p(int), p(ext) are called the priorprobabilities. • The probability P(ext|x) is called the posterior probability.

  19. Bayesian Parameter Estimation (step by step) • View the unknown parameter  as a random variable. • Use probability to quantify the uncertainty about the unknown parameter. • Update of the parameter follows from the rules of probability using Bayes rule The updated parameter is set to be the expected value of the posterior . posterior of likelihood prior of

  20. 0 0.2 0.4 0.6 0.8 1 Example: Binomial Data Revisited • Prior: say, a uniform distribution for in [0,1] • P() = 1 (say due to seeing one head h = 1 and one tail t = 1 before seeing the data) • Then for data (NH,NT) = (4,1) • MLE for is 4/5 = 0.8 • Bayesian estimation gives Dirichlet integral

  21. The multinomial case • The hyper-parameters 1,…,K can be thought of as “imaginary” counts from our prior experience • Imaginary sample size = 1+…+K • The larger the imaginary sample size the more confident we are in our prior. The more it influences the outcome. Bayesian Estimation vs. MLE

  22. Protein Example Continued(estimation of ext) • Let’s assume that from biological literature we know that 40% of the proteins are extracellular and 60% are intracellular • ext= p(ext)=0.4 • Recall that we had • Now, given a protein sequence x we have all the information we need to calculate p(ext|x).

  23. Protein Example (cont.)(updating ext) • To update ext after seeingthe sequence x, we need to decide what weight should we give our prior knowledge. • If our prior knowledge is worth seeing M sequences then the update will be as follow:

More Related