230 likes | 391 Views
Parameter Estimation using likelihood functions Tutorial #1. This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai . Changes made by Dan Geiger and Ydo Wexler. Example: Binomial Experiment.
E N D
Parameter Estimation using likelihood functionsTutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler. .
Example: Binomial Experiment • When tossed, it can land in one of two positions: Head or Tail Head Tail • We denote by the (unknown) probability P(H). Estimation task: • Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -
i.i.d. samples Statistical Parameter Fitting • Consider instances x[1], x[2], …, x[M] such that • The set of values that x can take is known • Each is sampled from the same distribution • Each sampled independently of the rest • The task is to find a vector of parameters that have generated the given data. This vector parameter can be used to predict future data.
L() 0 0.2 0.4 0.6 0.8 1 The Likelihood Function • How good is a particular ?It depends on how likely it is to generate the observed data • The likelihood for the sequence H,T, T, H, H is
Sufficient Statistics • To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) • NH and NT are sufficient statistics for the binomial distribution
Formally, s(D) is a sufficient statistics if for any two datasets D and D’ • s(D) = s(D’ ) LD() = LD’ () Datasets Statistics Sufficient Statistics • A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood
Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function • This is one of the most commonly used estimators in statistics • Intuitively appealing • One usually maximizes the log-likelihood function defined as lD() = logeLD()
L() 0 0.2 0.4 0.6 0.8 1 (Which coincides with what one would expect) Example: (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6 Example: MLE in Binomial Data • Applying the MLE principle we get
Sufficient statistics: • N1, N2, …, NK - the number of times each outcome is observed Likelihood function: MLE: From Binomial to Multinomial • For example, suppose X can have the values 1,2,…,K (For example a die has 6 sides) • We want to learn the parameters 1, 2. …, K
MLE: Example: Multinomial • Let be a protein sequence • We want to learn the parameters q1, q2,…,q20 corresponding to the frequencies of the 20 amino acids • N1, N2, …, N20 - the number of times each amino acid is observed in the sequence Likelihood function:
Is MLE all we need? • Suppose that after 10 observations, • ML estimates P(H) = 0.7 for the thumbtack • Would you bet on heads for the next toss? • Suppose now that after 10 observations, • ML estimates P(H) = 0.7 for a coin • Would you place the same bet? • Solution: The Bayesian approach which incorporates your subjective prior knowledge. E.g., you may know a priori that some amino acids have the same frequencies and some have low frequencies. How would one use this information ?
Bayes’ rule Bayes’ rule: Where It hold because:
What is the probability the die is loaded? Example: Dishonest Casino • A casino uses 2 kind of dice: 99% are fair 1% is loaded: 6 comes up 50% of the times • We pick a die at random and roll it 3 times • We get 3 consecutive sixes
Dishonest Casino (cont.) The solution is based on using Bayes rule and the fact that while P(loaded | 3sixes) is not known, the other three terms in Bayes rule are known, namely: • P(3sixes | loaded)=(0.5)3 • P(loaded)=0.01 • P(3sixes) = P(3sixes | loaded)P(loaded)+P(3sixes | not loaded) (1-P(loaded))
p(ai|int) - the frequency of amino acid ai for intracellular proteins p(ai|ext) - the frequency of amino acid ai for extracellular proteins Biological Example: Proteins • Extracellular proteins have a slightly different amino acid composition than intracellular proteins. • From a large enough protein database (SWISS-PROT), we can get the following: p(int) - the probability that any new sequence is intracellular p(ext) - the probability that any new sequence is extracellular
Biological Example: Proteins (cont.) • What is the probability that a given new protein sequence x=x1x2….xn is extracellular? • Assuming that every sequence is either extracellular or intracellular (but not both) we can write . • Thus,
By Bayes’ theorem Biological Example: Proteins (cont.) • Using conditional probability we get , • The probabilities p(int), p(ext) are called the priorprobabilities. • The probability P(ext|x) is called the posterior probability.
Bayesian Parameter Estimation (step by step) • View the unknown parameter as a random variable. • Use probability to quantify the uncertainty about the unknown parameter. • Update of the parameter follows from the rules of probability using Bayes rule The updated parameter is set to be the expected value of the posterior . posterior of likelihood prior of
0 0.2 0.4 0.6 0.8 1 Example: Binomial Data Revisited • Prior: say, a uniform distribution for in [0,1] • P() = 1 (say due to seeing one head h = 1 and one tail t = 1 before seeing the data) • Then for data (NH,NT) = (4,1) • MLE for is 4/5 = 0.8 • Bayesian estimation gives Dirichlet integral
The multinomial case • The hyper-parameters 1,…,K can be thought of as “imaginary” counts from our prior experience • Imaginary sample size = 1+…+K • The larger the imaginary sample size the more confident we are in our prior. The more it influences the outcome. Bayesian Estimation vs. MLE
Protein Example Continued(estimation of ext) • Let’s assume that from biological literature we know that 40% of the proteins are extracellular and 60% are intracellular • ext= p(ext)=0.4 • Recall that we had • Now, given a protein sequence x we have all the information we need to calculate p(ext|x).
Protein Example (cont.)(updating ext) • To update ext after seeingthe sequence x, we need to decide what weight should we give our prior knowledge. • If our prior knowledge is worth seeing M sequences then the update will be as follow: