Tests of structural equation models do not work: What to do ?

Tests of structural equation models do not work: What to do ? Willem E.Saris ESADE Universitat Ramon Llull

Concern about testing I have been worried about the testing procedures in SEM from my first contacts More then 25 years ago Albert Satorra and me wrote our first paper on the power of the test. Our worries have not been shared by the SEM community untill recently (Publication in SEM) I am very pleased that today I have the opportunity to convince you of our point of view

Importance of testing The purpose of SEM is to estimate the strength of relationships between variables correcting for measurement error All estimates are conditional on the specified model Therefore testing the models is essential for SEM 3

Content of my lecture Brief intro in SEM and the standard test Our criticism The alternative direction of the SEM community: fit indices The special case of RMSEA Why fit indices are not the solution Back to the basics An illustration

Introduction SEM by example A frequently discussed issue nowadays is whether Social Trust is related with Political Trust. Both latent variables are normally measures by three indicators • Path analysis suggests: • sij = likljm if k=m • sij = likrljm if k≠m

Estimation of effects The parameters are estimated by minimizing the following quadratic form: f = S wij (sij – sij)2 The estimates are the values which minimize this function The value of this function at its minimum is denoted by f0

Imagine that this is the observed correlation matrix Correlation Matrix y1 y2 y3 y4 y5 y6 -------- -------- -------- -------- -------- -------- y1 1.00 y2 0.64 1.00 y3 0.64 0.64 1.00 y4 0.32 0.32 0.32 1.00 y5 0.32 0.32 0.32 0.64 1.00 y6 0.32 0.32 0.32 0.64 0.64 1.00

The estimates LAMBDA-Y F 1 F 2 -------- -------- y1 0.80 - - y2 0.80 - - y3 0.80 - - y4 - - 0.80 y5 - - 0.80 y6 - - 0.80 Correlation of F1 with F2 = 0.50 We can estimate the relationship between latent variables and observed variables but also between latent variables

The residuals= differences between observed and expected correlations Residuals y1 y2 y3 y4 y5 y6 -------- -------- -------- -------- -------- -------- y1 0.00 y2 0.00 0.00 y3 0.00 0.00 0.00 y4 0.00 0.00 0.00 0.00 y5 0.00 0.00 0.00 0.00 0.00 y6 0.00 0.00 0.00 0.00 0.00 0.00

Imagine that the model in the population is different

Now the estimates are also different These estimates deviate somewhat from the values in the population The deviations are due to the misspecification Can we detect that the hypothesized model is wrong ?

The fitted residuals Based on these estimates the expected correlations can be calculated. The residuals (observed-expected correlations) can indicate that the model is misspecified In this case the residuals are:

When should the model be rejected ? Residuals can differ from zero due to misspecification of the model But also due to sampling fluctuations. So when should the model be rejected ?

The quality the test should have MacCallum, Browne and Sugawara (1996: 131) “if the model is truly a good model in terms of its fit in the population, we wish to avoid concluding that the model is a bad one. Alternatively, if the model is truly a bad one, we wish to avoid concluding that it is a good one.”

In statistical terms Required is: A small probability of a type 1 error i.e. the probability of rejection of a good model A small probability of a type II error i.e. the probability of acceptance of a bad model

Bad models are misspecified models Hu and Bentler (1998: 427): “a model is said to be misspecified when (a) one or more parameters are estimated whose population values are zeros (i.e. an over-parameterised misspecified model) (b) one or more parameters are fixed to zeros whose population values are non-zeros (i.e. an under-parameterised misspecified model) (c) or both.”

Definition of the size of a misspecification The size of the misspecification is the absolute difference between the true value of the parameter and the value specified in the analysis In the above example the size of the misspecification was .2

The standard chi2 test It can be shown that under very general conditions: the test statistic T = nF0 has a c2 (df)distribution if the model is correct The model is rejected if T > Ca where Cais the value for which pr(c2 (df) > Ca ) = a

Criticism The specified test does not test directly for misspecifications in the model The test checks possible consequences of misspecifications present in the residuals The specified test only controls the type I errors and not the type II errors

Can we evaluate type II errors ? It is well known that T has a non central c2 (df, ncp)distribution if the model is incorrect Due to a misspecification in the model the mean of the distribution of T increases with what is called the Noncentrality parameter (NCP)

The Central and noncentral chi2 distribution and the power

The non-centrality parameter NCP The NCP can be computed as shown by Satorra and Saris (1985) by generating population data and estimating the parameters with an incorrect model. The difference between the two models is the misspecification in the model In that case the value of the test statistic T is equal to the NCP for this misspecification given that the rest of the model is correct.

college titel en nummer An illustration

High Power (left) and low Power (right) • High power is good for big errors not for small errors. • Low power is good for small errors not for big errors • With loading .8 the left side applies. With loadings .5 the right side applies for the same error.

The standard test is not good enough The standard test can only detect misspecifications for which the test is sensitive (high power). Rejection of the model can be due to very small misspecifications for which the test is very sensitive Not rejection does not mean that the model is correct. The test can be insensitive for the misspecifications

The reasons for the problems Only type I errors are taken into account It is not a direct test of misspecifications but of consequences of misspecifications. These consequences (residuals) are also affected by other characteristics of the model

This was not the mainstream problem Hu and Bentler say: “the decision for accepting or rejecting a particular model may vary as a function of sample size, which is certainly not desirable.” This problem with the chi2 test has led to the development of a plethora of Fit indices.

Fit indices with cut-of criteria

Model evaluation with Fit indices The traditional model evaluation method has been replaced by a similar procedure using Fit indices. For fit indices that have a theoretical upper value of 1 for good fitting models (such as AGFI and GFI) , the model being rejected if: FI < Cfi There are however, also FIs for which a theoretical lower value of 0 indicates a good fit; for them the model is rejected if: FI > Cfi where Cfi is a fix cut-off value developed specifically for each FI.

Criticism For most indices the distribution is unknown. Only by Monte Carlo experiments, based on specific cases, arguments are made for critical values Only consequences for the residuals are evaluated and not the misspecifications themselves.

Goodness of fit by approximation Steiger (1990), Browne & Cudeck (1993) and MacCallum et al. (1996), have argued: models are always simplifications of reality and are therefore always misspecified. This has led to the most popular fit index nowadays: Root Mean Squared Error of Approximation or RMSEA Although there is truth in this argument, this is not a good reason to completely change the approach to model testing.

This is not necessary One has to design tests which take into account Type 1 and type 2 errors so that: Models with substantially relevant misspecifications should be rejected and Models with substantially irrelevant misspecifications should be accepted.

Serious problems The fit indices are functions of the fitting function So they have the same serious problems as the standard test Let us show that by very simple but fundamental models.

A model Mo with a substantively relevant misspecification Population model M1 Hypothesized model M0 The misspecification is in the correlated disturbance terms The size of the misspecification in .2 Without detection the misspecification b21=.2 not .0 ! This model should be rejected

A model Mo with a substantively irrelevant misspecification Population model M1 Hypothesized model M0 The misspecification is in the correlated factors The size of the misspecification in .05 For all practical purposes this model should be accepted

Population data

Population study with different values of g22

A model Mo with a substantive irrelevant misspecification Population model M1 Hypothesized model M0 The misspecification is in the correlated factors The size of the misspecification in .05 For all practical purposes this model should be accepted

Population study of the factor model The better the measures are the more likely it is that the model is rejected This is not a very attractive test S SRMR RMSEA

These examples show The model with a substantively relevant misspecification will most likely not be rejected The model with a substantively irrelevant misspecification will most likely be rejected This is the opposite of what all of us would like

We see what should not happen In contrast to what MacCallum, Browne and Sugawara (1996: 131) required: A bad model will not be rejected A good model will be rejected

Conclusion We could say paraphrasing Hu and Bentler : “the decision for accepting or rejecting a particular model may vary as a function of irrelevant parameters, which is certainly not desirable.” So there are reasons enough to consider alternative procedures for testing these models.

Can information about the power help? We have thought that information about the power of the test can help to test hypotheses about single parameters or small sets of parameters Let me illustrate this by the last example

We want to test if the factors measure the same i.e. Correlate perfectly Population model M1 Hypothesized model M0 The misspecification is in the correlated factors What is the power of the chi2 test if the size of the misspecification is .10

The power of the test

Now we can design the test Given that the loadings are around .8 And we accept a type I error of .05 (a) And we want to have a high power (.8) to detect a deviation of .1 or more Then we should have a sample size of at least 300 cases In this case the model should be rejected If T > 3.84

Criticism The problem of this test is that we have to suppose that there are no other misspecifications in the model If there are other misspecifications they can be the cause of the rejection of the model

There are many other possible errors

The situation is even worse The model test requires a test for all parameters But the tests are unequally sensitive for misspecifications in different parameters We can only expect that the test detects misspecifications for which the test is sensitive This sensitivety depends on characteristics of the model that have nothing to do with the size of the misspecification.

For example NCP

Tests of structural equation models do not work: What to do ?