280 likes | 1.21k Views
Tutorial 6. Bias and variance of estimators The score and Fisher information Cramer-Rao inequality. Estimators and their Properties.
E N D
Tutorial 6 • Bias and variance of estimators • The score and Fisher information • Cramer-Rao inequality 236607 Visual Recognition Tutorial
Estimators and their Properties • Let be a parametric set of distributions. Given a sample drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution). • An estimator for w.r.t. is any function notice that an estimator is a random variable. • How do we measure the quality of an estimator? • Consistency: An estimator for is consistent if this is a (desirable) asymptotic property that motivates us to acquire large samples. But we should emphasize that we are also interested in measures for finite (and small!) sample sizes. 236607 Visual Recognition Tutorial
Estimators and their Properties • Bias: Define the bias of an estimator to be Here, the expectation is w.r.t. to the distribution The estimator is unbiased if its bias is zero • Example: the estimators and , for the mean of a normal distribution, are both unbiased. The estimator for its variance is biased whereas the estimator is unbiased. • Variance: another important property of an estimator is its variance . We would like to find estimators with minimum bias and variance. • Which is more important, bias or variance? 236607 Visual Recognition Tutorial
Risky Estimators • Employ our decision-theoretic framework to measure the quality of estimators. • Abbreviate and consider the square error loss function • The conditional risk associated with when is the true parameter • Claim: • Proof: 236607 Visual Recognition Tutorial
Bias vs. Variance • So, for a given level of conditional risk, there is a tradeoff between bias and variance. • This tradeoff is among the most important facts in pattern recognition and machine learning. • Classical approach: Consider only unbiased estimators and try to find those with minimum possible variance. • This approach is not always fruitful: • The unbiasedness only means that the average of the estimator (w.r.t. to ) is . It doesn’t mean it will be near for a particular sample (if variance is large). • In general, an unbiased estimate is not guaranteed to exist. 236607 Visual Recognition Tutorial
The Score • The score of the family is the random variable measures the “sensitivity” of as a function of the parameter . • Claim: • Proof: • Corollary: 236607 Visual Recognition Tutorial
The Score - Example • Consider the normal distribution • clearly, • and 236607 Visual Recognition Tutorial
The Score - Vector Form • In case where is a vector, the score is the vector whose th component is • Example: 236607 Visual Recognition Tutorial
Fisher Information • Fisher information: Designed to provide a measure of how much information the parametric probability law carries about the parameter . • An adequate definition of such information should possess the following properties: • The larger the sensitivity of to changes in , the larger should be the information • The information should be additive: The information carried by the combined law should be the sum of those carried by and • The information should be insensitive to the sign of the change in and preferably positive • The information should be a deterministic quantity; should not depend on the specific random observation 236607 Visual Recognition Tutorial
Fisher Information • Definition (scalar form):Fisher information (about ), is the variance of the score • Example: consider a random variable 236607 Visual Recognition Tutorial
Fisher Information - Cntd. • Whenever is a vector, Fisher information is the matrix where • Remainder: • Remark: the Fisher information is only defined whenever the distributions satisfy some regularity conditions. (For example, they should be differentiable w.r.t. and all the distributions in the parametric family must have same support set). 236607 Visual Recognition Tutorial
Fisher Information - Cntd. • Claim: Let be i.i.d. random variables . The score of is the sum of the individual scores. • Proof: • Example: If are i.i.d. , the score is 236607 Visual Recognition Tutorial
Fisher Information - Cntd. • Based on i.i.d. samples, the Fisher information about is • Thus, the Fisher information is additive w.r.t. i.i.d. random variables. • Example: Suppose are i.i.d. . From previous example we know that the Fisher information about the parameter based on one sample is Therefore, based on the entire sample, 236607 Visual Recognition Tutorial
The Cramer-Rao Inequality • Theorem: Let be an unbiased estimator for . Then • Proof: Using we have: 236607 Visual Recognition Tutorial
The Cramer-Rao Inequality - Cntd. • Now 236607 Visual Recognition Tutorial
The Cramer-Rao Inequality - Cntd. • So, • By the Cauchy-Schwarz inequality • Therefore, • For a biased estimator we have: 236607 Visual Recognition Tutorial
The Cramer-Rao General Case • The Cramer-Rao inequality also true in general form: The error covariance matrix for is bounded as follows: 236607 Visual Recognition Tutorial
The Cramer-Rao Inequality - Cntd. • Example: Let be i.i.d. . From previous example • Now let be an (unbiased) estimator for . • So matches the Cramer-Rao lower bound. • Def: An unbiased estimator whose covariance meets the Cramer-Rao lower bound is called efficient. 236607 Visual Recognition Tutorial
Efficiency • Theorem (Efficiency): The unbiased estimator is efficient, that is, iff • Proof (If): If then meaning 236607 Visual Recognition Tutorial
Efficiency • Only if: Recall the cross covariance between The Cauchy-Schwarz inequality for random variables says thus 236607 Visual Recognition Tutorial
Cramer-Rao Inequality and ML - Cntd. • Theorem: Suppose there exists an efficient estimator for all . Then the ML estimator is . • Proof: By assumption By previous claim or for all This holds at and since this is a maximum point the left side is zero so 236607 Visual Recognition Tutorial