220 likes | 252 Views
Uncertainty and confidence intervals. Statistical estimation methods, Finse Friday 10.9.2010, 12.45–14.05 Andreas Lindén. Outline. Point estimates and uncertainty Sampling distribution Standard error Covariation between parameters Finding the VC-matrix for the parameter estimates
E N D
Uncertainty andconfidence intervals Statistical estimation methods, Finse Friday 10.9.2010, 12.45–14.05 Andreas Lindén
Outline • Point estimates and uncertainty • Sampling distribution • Standard error • Covariation between parameters • Finding the VC-matrix for the parameter estimates • Analytical formulas • From the Hessian matrix • Bootstrapping • The idea behind confidence intervals • General methods for constructing confidence intervals of parameters • CI based on the central limit theorem • Profile likelihood CI • CI by bootstrapping
Point estimates and uncertainty • The main output in any statistical model fitting are the parameter estimates • Point estimates — one value for each parameter • The effect sizes • Answers the question “how much” • Point estimates are of little use without any assessment of uncertainty • Standard error • Confidence intervals • p-values • Estimated sampling distribution • Bayesian credible intervals • Plotting Bayesian posterior distribution
Sampling distribution • The probability distribution of a parameter estimate • Calculated from a sample • Variability due to sampling effects • Typically depends on sample size or the number of degrees of freedom (df) • Examples of common sampling distributions • Student’s t-distribution • F-distribution • χ²-distribution
Degrees of freedom In a linear regression df = n – 2 Y X
Properties of the sampling distribution The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution Square root of parameter variance Parameters are not necessarily unrelated The sampling distribution of several parameters is multivariate Example: regression slope and intercept 6
Linear regression – simulated data Param. a b σ² True value 4.00 1.00 0.80 Estim. 1 4.29 0.96 0.70 Estim. 2 4.13 0.97 0.36 Estim. 3 3.86 0.98 0.83 Estim. 4 3.77 1.04 0.75 Estim. 5 3.63 1.06 0.63 Estim. 6 4.39 0.93 0.72 Estim. 7 3.80 0.98 0.91 Estim. 8 3.78 1.06 0.92 Estim. 9 3.74 1.07 0.69 Estim. 10 4.62 0.84 0.50 … …… … Estim 100 3.54 1.06 0.71
Properties of the sampling distribution The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution Square root of parameter variance Parameters are not necessarily unrelated The sampling distribution of several parameters is multivariate Example: regression slope and intercept 0.1531 -0.0273 0.0031 COV = -0.0273 0.0059 0.0002 0.0031 0.0002 0.0335 1.0000 -0.9085 0.0432 CORR = -0.9085 1.0000 0.0159 0.0432 0.0159 1.0000 8
Properties of the sampling distribution The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution Square root of parameter variance Parameters are not necessarily unrelated The sampling distribution of several parameters is multivariate Example: regression slope and intercept Methods to obtain the VC-matrix (or standard errors) for a set of parameters Analytical formulas Bootstrap The inverse of the Hessian matrix 9
Parameter variances analytically • For many common situations the SE and VC-matrix of a set of parameters can be calculated with analytical formulas • Standard error of the sample mean • Standard error of the estimated binomial probability
Bootstrap • The bootstrap is a general and common resampling method • Used to simulate the sampling distribution • Information in the sample itself is used to mimic the original sampling procedure • Non-parametric bootstrap — sampling with replacement • Parametric bootstrap — simulation based on parameter estimates • The procedure is repeated B times (e.g. B = 1000) • To make inference from the bootstrapped estimates • Sample standard deviation = bootstrap estimate of SE • Sample VC-matrix = bootstrap estimate of VC-matrix • Mean = difference between bootstrap mean and original estimate is an estimate of bias
VC-matrix from the Hessian • The Hessian matrix (H) • 2nd derivative of the (multivariate) negative log-likelihood at the ML-estimate • Typically given as an output by software for numerical optimization • The inverse of the Hessian is an estimate of the parameters’ variance-covariance matrix
Confidence interval (CI) • An frequentistic interval estimate of one or several parameters • A fraction α of all correctly produced CI:s will fail to include the true parameter value • Trust your 95% CI and take the risk α = 0.05 • NB! Should not be confused with Bayesian credible intervals • CI:s should not be thought to contain the parameter with 95% probability • The CI is based on the sampling distribution, not on an estimated probability distribution for the parameter of interest
CI based on central limit theorem • The sum/mean of many random values are approximately normally distributed • Actually t-distributed with df depending on sample size and model complexity • Might matter with small sample size • As a rule of thumb, an arbitrary parameter estimate ± 2*SE produce an approximate 95% confidence interval • With infinitely many observations ± 1.96*SE
CI from profile likelihood • The profile deviance • The change in −2*log-likelihood, in comparison to the ML-estimate • Asymptotically χ²-distributed (assuming infinite sample size) • Confidence intervals can be obtained as the range around the ML-estimate, for which the profile deviance is under a critical level • The 1 – α quantile from χ²-distribution • One-parameter -> df = 1 (e.g. 3.841 for α = 0.05) • k-dimensional profile deviance -> df = k
95% CI from profile deviance –2*LL Fmin + 3.841 Fmin Parameter value
2-D confidence regions 99% confidence region, deviance χ²df2 = 9.201 95% confidence region, deviance χ²df2 = 5.992 Parameter b Parameter a 18
CI by bootstrapping A 100*(1 – α)% CI for a parameter can be calculated from the sampling distribution The α / 2 and 1 – α /2 quantiles (e.g. 0.025 and 0.975 with α = 0.05) In bootstrapping, simply use the sample quantiles of simulated values 19
Exercises • Data: The prevalence of an infectious disease in a human population is investigated. The infection is recorded with 100% detection efficiency. In a sample of N = 80 humans X = 18 infections were found. • Model: Assume that infection (x = 0 or 1) of a host individual is an independent Bernoulli trial with probability pi, such that the probability of infection is constant over all hosts. • (This equals a logistic regression with an intercept only. Host specific explanatory variables, such as age, condition, etc. could be used to improve the model of pi closer.)
Do the following in R: • Calculate and plot the profile (log) likelihood of infection probability p • What isthe maximum likelihood estimate of p (called p̂ )? • Construct 95% and 99% confidence intervals for p̂ based on the profile likelihood • Calculate the analytic SE for p̂ • Construct symmetric 95% confidence interval for p̂ based on the central limit theorem and the SE obtained in previous exercise • Simulate and plot the sampling distribution of p̂ by parametric bootstrapping (B = 10000) • Calculate the bootstrap SE of p̂ • Construct 95% confidence interval for p̂ based on the bootstrap