330 likes | 539 Views
Introduction to and Overview of DEf An R software package for cross-cultural research. E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences Conference, Albuquerque, March 18, 2014. The two major problems with cross-cultural data analysis addressed by DEf are:. Missing Data
E N D
Introduction to and Overview of DEfAn R software package for cross-cultural research E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences Conference, Albuquerque, March 18, 2014
The two major problems with cross-cultural data analysis addressed by DEf are: Missing Data All of the major cross-cultural data sets have substantial missing data. Single imputation methods – mean substitution, regression predicted scores, hot deck, etc. – result in coefficient variance estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion, generally result in small samples (loss of power) and also require very strong assumptions about why data are missing. These assumptions are very unlikely to hold. Single imputation methods are no longer recommended. DEfemploys the Multiple Imputation by Chained Equations (mice) approach to handling missing data. Non-Independence of Sample Units Sample cases in cross-cultural and cross-national data are frequently not independent of one another due to various inter-societal network processes: cultural trait borrowing, conquest, emulation, inheritance from ancestral populations, etc. This is the classic Galton’s Problem in anthropology, understood more generally as the problem of cultural trait transmission. DEfaddresses this issue by incorporating networks of relations into regression models, and employing instrumental variables procedures to generate consistent and relatively efficient estimates.
First problem: Missing Data Societymarkinmarkoutmoneycommlandsharefood NamaHottentot NA NA 1 NA NA Kung Bushmen 1 4 1 3 6 Thonga 4 4 3 3 6 Lozi 3 3 1 3 NA Mbundu NA NA 4 NA NA Suku 2 2 4 2 2 Two solutions: Listwise deletion Multiple imputation
Listwise deletion Societymarkinmarkoutmoneycommlandsharefood NamaHottentot NA NA1 NA NA Kung Bushmen 1 4 1 3 6 Thonga 4 4 3 3 6 Lozi3313 NA Mbundu NA NA4 NA NA Suku 2 2 4 2 2 • Lose three observations. Lose all of the information in the cells marked in red. • Of 186 societies, 156 would have been dropped using listwise deletion. No longer testing against the full range of human societies. Losing the big advantage of the SCCS. Probable sample selection bias.
Multiple imputation Replace missing values with imputed values, drawn from conditional distribution. Create several (5 to 10) new data sets with imputed values. Societymarkinmarkoutmoneycommlandsharefood Nama Hottentot 34 1 23 Kung Bushmen 1 4 1 3 6 Thonga 4 4 3 3 6 Lozi 3 3 1 3 6 Mbundu 45 4 31 Suku 2 2 4 2 2 Societymarkinmarkoutmoneycommlandsharefood Nama Hottentot 23 1 12 Kung Bushmen 1 4 1 3 6 Thonga 4 4 3 3 6 Lozi 3 3 1 3 4 Mbundu 35 4 53 Suku 2 2 4 2 2 Societymarkinmarkoutmoneycommlandsharefood Nama Hottentot 35 1 23 Kung Bushmen 1 4 1 3 6 Thonga 4 4 3 3 6 Lozi 3 3 1 3 5 Mbundu 26 4 42 Suku 2 2 4 2 2
Step 1 of the DEf Approach to Multiple Imputation of Missing Data: finding auxiliary variables. The mice procedure imputes values for missing observations on the variables specified in the structural regression model of interest, using both these variables themselves plus a set of auxiliary variables. Ideal auxiliary variables are usually a subset of those with no missing values in the full data set. Auxiliary variables must be correlated with the variables in the structural regression model that have missing values, since the imputation procedure is designed to “borrow” information from them to help impute the missing values. DEf will employ auxiliary variables provided by the user. Alternatively, DEf will identify suitable auxiliary variables as follows: 1. identify all categorical, ordinal, interval variables with no missing values in the complete data set. 2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable: i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the covariate that provides the highest correlation, and save the residual ii) add to the regression model the covariate that correlates highest with the residual, and calculate the new residual iii) repeat the above steps 8 times (or more) iv) calculate the relative importance of predictors, drop variables that fall below a given threshold, and recalculate the residual v) repeat steps ii – iv.
Step 2: Create m complete data sets • The mice procedure is repeated m times to create m copies of the data set, each containing different sets of imputed values. • Since each data set is now complete, each can be analyzed using any of the usual statistical models that require complete data. • m = 10 - 100 is currently suggested, depending on sample size and amounts of missing data.
Step 3: Analyzing the data and pooling the results: Rubin’s Rules
Analyzing the data and pooling the results, cont…. Rubin’s pooling procedures can be done with any statistic generated by the statistical method employed to analyze the m imputed data sets.
Galton’s Problem Incorporating inter-societal networks into network autocorrelation effects regression models
Galton’s problem Observations not independent. • Common descent (language phylogeny) • Cultural borrowing (geographic distance) In regression context, Galton’s problem will cause biased coefficients and biased standard errors.
Galton’s problem example: Hypothesis: Drinking alcohol dampens the libido of religious specialists. alcohol wives Ecuador 1 0 Iran 0 2 Ireland 1 0 Morocco 0 3 Spain 1 0 Yemen 0 4 Pearson correlation= -0.9332565, p-value=0.0065 Adapted from Victor de Munck and Andrey Korotayev. 2000. “Cultural Units in Cross-Cultural Research.“ Ethnology 39(4): 335-348 An observed correlation between a pair of cultural traits across cultures could be due to the borrowing of the traits, as a package, from a common source (“horizontal transmission”), or could be due to their transmission, as a package, from a common ancestor (“vertical transmission”), or could be due to a true functional relationship.
What processes might be inducing non-independence? • Spatial Diffusion: societies in close proximity have more opportunity to emulate, conform to, adopt, borrow, etc. neighbors behaviors, beliefs, customs, rituals… (horizontal diffusion.) • Language similarity: Similarity due to populations splitting off from same ancestral population. (vertical diffusion.) • Religion: Marriage practices spread world-wide by the colonization of large swaths of the world by European Christian nations. • Equivalence: units “similarly situated” in a network and not necessarily proximate. E.g., economic similarity, core/periphery in world system, colonial status, ecological setting, …
Assessing non-independence: Tobler’s First Law of Geography “Everything is related to everything else, but near things are more closely related than distant things.” This “law” suggests that the scores on variable y for the ith societyshould be similar to the scores of those societies with which it has the closest relationships. Call these societies i’s “neighborhood set.” If so,yishould be similar to the weighted average of the set of y scores for i’s neighborhood set, where the weights indicate relative closeness. If the N scores on yare significantly correlated with the N weighted average scores, conclude the yvariable is auto-(self)-correlated.
Weighting sample units. First , need to construct an NxN connectivity matrix C of pair-wise relatedness scores among sample units, and then row-normalize C to unity to get the required weights matrix W. That is, wij= cij ⁄Σjcij. (If a variable yis premultiplied by W, i.e. Wy, the product will be an Nx1 vector of weighted averages that are on the same scale as y.)
Incorporating autocorrelated variables into multiple regression • Most cross-cultural researchers are usually interested in testing whether hypothesized predictor variables are acting on a dependent variable, as well as what processes are inducing autocorrelation in it. • The Network Autocorrelation Regression Effects Models in DEf do just that.
Most commonly used network autocorrelation regression model is: Network Autocorrelation Effects model: y = α + ρWy + Xβ + ε Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are related, 0 otherwise, and wii = 0 for all i; ρisthenetwork autocorrelation coefficient; y is an Nx1 vector; Wyis an Nx1 vector where each element i is a weighted average of y values fori’s neighborhood set; X is an Nxk matrix of exogenous variables; βis an kx1 vector of coefficients; εis an Nx1 vector of error terms. Also called the Network “Lag” model, by analogy to time series, since W acts similarly to the lag operator in time series models, except that W lags the y variable in other kinds of social and physical “spaces.” This is the model currently implemented in DEf
Estimating the network autocorrelation effects regression model y= α + ρWy + Xβ + ε • MLE: Maximum Likelihood Estimation. This is usually the method of choice. But the log-likelihood function contains the term ln|A|, whereA= (I – ρW). Since Aisasymmetric and usually not sparse, finding the eigenvalues is computationally burdensome for large N. And, for more than two endogenous Wyvariables, the likelihood function is intractable. • OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s. variables be independent of (uncorrelated with) the error term ε. If not, all coefficient estimates (ρand β) are biased and inconsistent. Here,y is by definition a function of ε, soWy is also a function of ε. That is, Cov(Wy, ε) ≠ 0. Wyis thus an endogenous regressor. • IV:Instrumental Variables (IV). Provides a way to obtain consistent parameter estimates for models with endogenous variables. 2SLS is an IV estimation procedure. Can deal with large samples and multiple endogenous variables. DEfuses IV estimation procedures.
An “intuitive” view of the IV regression approach OLS model: y = α + ρWy + ε ε Z Wy y Zis an instrument for Wyif Cov(Z,ε) = 0 (Z is valid) and Cov(Z,Wy) ≠ 0 (Z is relevant). So, need to find an additional variable(s) Z that is correlated with Wy but uncorrelated with ε to serve as an instrument for Wy.
An “intuitive” view of the 2SLS IV estimation procedure Consider again the network effects model y = α + ρWy + Xβ + ε Suppose we use WX, the lagged values of X, as an instrument for Wy. Step 1. Using OLS, estimateWy = a + WXc + υ Save the predicted scoresŷw = â + WXĉ Step 2. Again using OLS, estimatey = α + ρ ŷw + Xβ + ε (Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1-step procedures used in all the usual software packages.)
2SLS Estimation of the network autocorrelation effects regression model with IVs: general case y = α + Xβ + ε
Where to get appropriate instruments? • Usually, it’s hard to find additional variables that meet the conditions required. Variables that affect the endogenous variable(s) are often also likely to affect the dependent variable. • Kelejian and Prucha (1998) show that the set of {WX, W2X, W3X,…} variables are optimal as instruments for Wy, where W2, W3,…. are the 2-step and 3-step connections between sample units. In practice, the WX variables or some subset of them will usually be sufficient.
Evaluating the quality of the instrumental variables Quality of 2SLS estimators depends on the quality of the IVs. Require that • Cov(Z,ε) = 0. IVs must be valid. IVestimation is vulnerable on this point. Tests are available only if there are more instruments than endogenous variables (overidentification.) • IVs also need to be relevant. i.e., they should predict endogenous variables independently of other exogenous variables. Shea (1997) proposed a partial R2 measure of instrument relevance for multiple endogenous variable models. • Marginal associations between endogenous variable(s)and Z isknown as the “weak” instruments problem. Some diagnostics are available. • No perfect collinearity between all exogenous variables.
Overidentification tests • If there is more than 1 instrumental variable available for Wy, can test the null hypothesis that at least one of them is correlated with the errors. • Sargan (1958) is the best known test: Ts = NR2u ~ χ2 (withdf = #IVs - #endogenous variables) where R2u is the R2 of OLS regression of 2SLS residuals on the IVs. • Basmann (1960) provides an alternate, though similar, test. • Kirby and Bollen (2009) discuss additional variants of Sargan and Basmann in the context of SEM.
“Weak” Instruments • Bound et al (1995) show that when the instruments are only weakly correlated with the endogenous variables IV estimates are biased in the same direction as OLS estimates, and may be more biased than OLS. In addition, weak IV regression estimates may not be consistent. • Staiger and Stock (1997) suggest that the partial F-statistic from the increase in the regression R2 after adding the auxiliary instruments to the exogenous variables in the first stage regression should be greater than 10. • Stock and Yogo (2005) provide tables that give some guidance as to how much greater than 10 the F-statistic may have to be.
Example: Monogamy in the Pre-industrial World Multiple proposed determinants of the long-term historical shift in marriage preference from polygynous to monogamous unions are tested using data from the Standard Cross-Cultural Sample.
W matrices employed • Geographical Distance: the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2 Use only the nearest 20 societies. • Language similarity: the WLmatrix is described in Eff (2008), where cij = e-score(ij) If the Ws are collinear, can combine them into a single matrix: WDL = πDWD + πLWL where 0 ≤ πD, πL ≤ 1 and πD + πL =1 Then, run all combinations of WDL and select as “best” the matrix that maximizes R2iv Also obtain information on the weights that yield the “best” combined W.
2SLS estimation of network autocorrelation regression model using composite distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of married females in monogamous marriages [monofem(λ – 1)/λ) ]
Summary: • DEf is a new statistical package designed for cross-cultural and cross-national data sets. • Given the ubiquity of missing data in such data sets, DEf includes a suite of programs for multiple imputation of missing data • Given that sample units in comparative data sets are non-independent due to various processes of cultural trait diffusion, DEf includes a suite of programs to implement network autocorrelation effects models. • Available as R workspace and on XSEDE CoSSci/DEf Science Gateway.