§❶ Review of Likelihood Inference

§❶ Review of Likelihood Inference Robert J. Tempelman

Likelihood Inference Data Analysts: Don’t throw away that Math Stats text just yet!!! Meaningful computing skills is a plus! • Necessary prerequisites to understanding Bayesian inference • Distribution theory • Calculus • Asymptotic theory (e.g, Taylor expansions) • Numerical Methods/Optimization • Simulation-based analyses • Programming Skills • SAS PROC ???? or R package ???? is only really a start to understanding data analysis. • I don’t think that SAS PROC MCMC (version 9.3)/WinBuGs is a fix to all of your potential Bayesian inference problems.

The “simplest” model pdf: probability density function Conditional independence joint pdf is product of independent pdf’s • Basic mean model: • Common distributional assumption: • What does this mean? Think pdf!!!

Likelihood function ‘proportional to’ Simplify joint pdf further Regard joint pdf as function of parameters

Maximum likelihood estimation • Maximize with respect to unknowns. • Well actually, we directly maximize log likelihood • One strategy: Use first derivatives: • i.e., determine and and set to 0. • Result?

Example Data, Log Likelihood & Maximum Likelihood estimates

Likelihood inference for discrete data Set to zero → Consider the binomial distribution:

Sometimes iterative solutions are required • First derivative based methods can be slow for some problems. • Second-derivative methods are often desirable, e.g. Newton-Raphson • Generally faster • Provide asymptotic standard errors as useful by-product

Plant Genetics Example(Rao, 1971) y1, y2, y3, and y4 are the observed numbers of 4 different phenotypes involving genotypes different at two loci from the progeny of self-fertilized heterogygotes (AaBb). It is known that under genetic theory that the distribution of four different phenotypes (with complete dominance at each loci) is multinomial.

Probabilities  → 0: close linkage in repulsion → 1: close linkage in coupling 0  1

Genetic Illustration of Coupling/Repulsion A a A a B b b B  = 0  = 1 Coupling Repulsion

Likelihood function Given:

First and second derivatives First derivative: Second derivative: Recall Newton Raphson algorithm:

Newton Raphson:SAS data step and output data newton; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.01; /* try starting value of 0.50 too */ do iterate = 1 to 5; loglike = y1*log(2+theta) + (y2+y3)*log(1-theta) + y4*log(theta); firstder = y1/(2+theta) - (y2+y3)/(1-theta) + y4/theta; secndder = (-y1/(2+theta)**2 - (y2+y3)/(1-theta)**2 - y4/theta**2); theta = theta + firstder/(-secndder); output; end; asyvar = 1/(-secndder); /* asymptotic variance of theta_hat at convergence */ output; run; procprint data=newton; variterate theta loglike; run;

Asymptotic standard errors Observed information procprint data=newton; varasyvar; run; Given:

Alternative to Newton Raphson Expected information • Fisher’s scoring • Substitute for in Newton Raphson . • Now • Then

Fisher scoring:SAS data step and output: In some applications, Fisher’s scoring is easier than Newton Raphson…but observed information probably more reliable than expected information (Efron and Hinckley, 1978 ) data newton; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.01; /* try starting value of 0.50 too */ do iterate = 1 to 5; loglike = y1*log(2+theta) + (y2+y3)*log(1-theta) + y4*log(theta); firstder = y1/(2+theta) - (y2+y3)/(1-theta) + y4/theta; secndder = (n/4)*(-1/(2+theta) - 2/(1-theta) - 1/theta); theta = theta + firstder/(-secndder); output; end; asyvar = 1/(-secndder); /* asymptotic variance of theta_hat at convergence */ output; run; procprint data=newton; variterate theta loglike; run;

Extensions to multivariate q. Suppose that q is p x 1 vector. Newton Raphson Fisher’s scoring or

Generalized linear models • For multifactorial analysis of non-normal (binary, count) data. • Consider the probit link binary model. • Implies the existence of normally distributed latent (underlying) variables (i). • Could do something similarly for logistic link binary model • Consider a simple population mean model: • i = m + ei ; ei ~ N(0, e2 ) • Let  = 10 and e= 2

The liability (latent variable) concept Y=0 (“failure”) Y=1 (“success”) =12 (“THRESHOLD”) pdf(i) • e = 2 i.e. probability of “success” = 15.87% i  = 10

Inferential utopia! • Suppose we’re able to measure the liabilities directly • Also suppose a more general multi-population(trt) model = Xa + e; e ~ N(0, R); typically R = Is2 = ML(a) = OLS(a): But (sigh…), we of course don’t generally observe l

Suppose there are 3 subclasses  = Xa + e Herd 1 Herd 2 Herd 3 Mean liabilities Use “corner parameterization”:

Probability of success as function of effects (can’t observe liabilities…just observed binary data) Shaded areas

Reparameterize model Herd 1 Herd 2 Herd 3 Let dand xi'a = (m + xi'*a*) cannot be estimated separately from s2e….i.e., s2e not identifiable.

Reparameterize the model again. Notice that the new threshold is now 12/2 = 6, whereas the mean responses for the three herds are now 9/2, 10/2 and 11/2 consider the remaining parameters as standardized ratios: t = d/ se, x = m/se, and b = a*/se -> same as constraining se = 1.

There is still another identifiability problem Notice that the new threshold is now 0, whereas the mean responses for the three herds are now -1.5, -1 and -0.5 • Between tand x • One solution? • “zero out” t.

highervalues of translate into lowerprobabilities of disease Note that

Deriving likelihood function y = 0,1 Given: i.e., Suppose you have second animal (i’) Suppose animals i and i’ are conditionally independent Example

Deriving likelihood function → • More general case • conditional independence • So…likelihood function for probit model: • Alternative: logistic model:

Small probit regression example Link function = probit Data

Log likelihood Fisher’s scoring: Newton Raphson equations can be written as:

A SAS program procgenmod data=example_binary descending; class y; model y = x /dist=bin link=probit; contrast 'slope ' x 1; run; dataexample_binary; input x y; cards; 40 1 40 1 40 0 43 0 43 1 43 0 47 0 47 0 47 1 50 0 50 0 50 0 ;

Key output

Wald test • Asymptotic inference: • Reported standard errors are square roots of diagonals. • Hypothesis test: on K’b = 0: When is n “large enough” for this to be trustworthy????

Likelihood ratio test Reduced Model: -2 (logLreduced - logLfull) = -2(-7.63 - -6.210) =2.84 Ho: b1 = 0 is Prob(c21 >2.84) = .09. Again..asymptotic procgenmod data=example_binary descending; class y; model y = /dist=bin link=probit; run;

A PROC GLIMMIX “fix” for uncertainty:use asymptotic F-tests rather than c2-tests “less asymptotic?” procglimmix data=example_binary ; model y = x /dist=bin link=probit; contrast 'slope ' x 1; run;

Ordinal Categorical Data • How I learned this? • “Sire evaluation for ordered categorical data with a threshold model” by Dan Gianola and Jean Louis Foulley (1983) in Genetics, Selection, Evolution 15:201-224. (GF83) • See also Harville and Mee (1984) Biometrics (HM84) • Application: • Calving ease scores (0= unassisted, 5 = Caesarean) • Determined by underlying continuous liability relative to set of thresholds:

e = 2 Liabilities: Consider three different herds/subclasses:

Underlying normal densities for each of three herds. Probabilities highlighted for Herd 2

Constraints 2 4 6 8 Not really possible to separately estimate sefrom t1 , t2, m1, m2, andm3. Define then L* = L/ se, t1 * = t1 /se, m1* = m1/se, m2* = m2/se, and m3* = m3/se.

Yet another constraint required i.e., zero out m* Suppose we use the corner parameterization: when expressed as a ratio over seis Such that t1* or t2* are not separately identifiable from m* t1**= t1* - m* = 4.0- 5.5 = -1.5 t2**= t2* - m* = = 6.0- 5.5 = +0.5

m** t1**= t1* - m* = 4.0- 5.5 = -1.5 t2**= t2* - m* = = 6.0- 5.5 = +0.5

Alternative constraint. Estimate m but “zero out” one of t1 ort2,say t1 Start with and t1* = 4.0 and t2*= 6.0. Then: m**= m*-t1* = 5.5-4.0 = 1.5 t2** = t2* -t1*= 6.0 - 4.0 = 2.0

One last constraint possibility Setting t1 = 0and t2to arbitrary value > t1 and infer upon se Say se = 2. t1 fixed to 0; t2fixed to 4

Likelihood function for Ordinal Categorical Data Based on the multinomial (m categories) where and Likelihood: Log Likelihood:

Hypothetical small example • Ordinal outcome having 3 possible categories: • Two subjects in the dataset: • first subject has a response of 1 whereas the second has a response of 3. • Their contribution to the log likelihood:

Solving for ML • Let’s use Fisher’s scoring: • For a three+ category problem: • Now

Setting up Fisher’s scoring2ndderivatives(see GF83or HM84 for details) now

Setting up Fisher’s scoring1nd derivatives (see GF83 for details) Now with

Fisher’s scoring algorithm So

§❶ Review of Likelihood Inference

§❶ Review of Likelihood Inference

Presentation Transcript

Database Security Kennesaw State University

Semantic Inference for Question Answering

4-1 Statistical Inference

The Ladder of Inference: An Introduction

A Tutorial on Inference and Learning in Bayesian Networks

Biostatistics

APSTAT - Unit 4b

Inference in First-Order Logic

APSTAT - Unit 4b

Phylogenetic inference

Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference I: Hypothesis testing; sample size

Darwin Phones: the Evolution of Sensing and Inference on Mobile Phones

Statistical Inference and Regression Analysis: GB.3302.30

Propensity Score Analysis A tool for causal inference in non-randomized studies

Class 2 Statistical Inference

Bayesian models of human learning and inference Josh Tenenbaum MIT

Statistical inference for astrophysics

Bayes, birds and brains: applications of inference and probabilistic modelling

Do Now: Observation v. Inference