MLwiN a few pointers

MLwiNa few pointers In this session we give a little guidance as to how MLwiN works for some generic functions – data import, data manipulation and graphing. Using MlwiN for statistical modelling is not the topic of this session, that will be covered in later practical sessions.

Data import The latest version ofMLwiN can directly import and export STATA, MINITAB and SPSS. These options are located on the File Menu(Open and Save worksheet) Sometimes it is convenient to paste data in from the clipboard, see http://www.cmm.bristol.ac.uk/MLwiN/tech-support/support-faqs/data-in/index.shtml TIP – don’t import vast numbers of variables. Concentrate on the ones you need. Usually somewhere between 5 and 50 is appropriate.

Commands and GUI MLwiN has a Graphical User Interface(GUI) and a scripting language In this workshop we will use the GUI almost exclusively. However, some enthusiasts may want to learn the scripting language. It is documentented in the Help system. A slightly out of date command manual can be downloaded from http://www.cmm.bristol.ac.uk/MLwiN/download/manuals.shtml Many people use the package of their choice for data management and then import to MLwiN for multilevel modelling and visualations. Other people use MlwiN for data management and preparation as well as multilevel analysis.

Opening the software Menus with options for data manipulation, graphics and modelling Options for controlling model estimation Work area where windows for specific tasks appear Progress reporting

Opening a worksheet Variable summary window called the “Names window” always appears

Names window-one row per variable Toggle selected variable categorical/numeric View data for selected variables View/edit category names Naming/renaming vaiables Enter additional descriptive text for variable View only columns containing data Defining a variable as categorical or numeric effects how the variable is treated when entered into a model

Data Manipulation Menu Many common data manipulation, viewing and editing tasks can be achieved via windows brought up the Data Manipulation menu. For example selecting Data ManipulationView or Edit Data The data window with one row per “Case” Note this data is arranged as a flat file with values for school level variables repeated But be careful…

MLwiN is not a strict record based system MlwiN has, by default, 1500 columns that can contain variables. Can be increased by selecting Worksheet from the Options menu MLwin is not a record based system it will allow variables to be of different lengths in a worksheet. It is like a spread sheet in this respect. Be careful variables can get out of alignment

Getting things out of alignment For example if we sort on variable Normexam alone it is out of alignment Clicking undo button returns the data

Carrying: Re-ordering and selection operations When we re-order a variable or change the length of a variable we must explicitly “carry” all other variables that are linked horizontally in the same record. The sort window has a similar structure to many of the data manipulation windows in MlwiN. So carrying with the sort window, to ensure everything stays lined up, looks like this…

Carrying.. Left panel defines operation Variables to be carried Destination columns, same as input or new Buttons for removing particular actions, removing all actions, executing the actions, or undoing the actions Right panel lists actions requested NOTE – UNDO only available while the selected data manipulation window for the task at hand(in this case sort) is open, so inspect data before closing the the particular data manipulation window you are using.

Missing Data If you have a single value coding for missing data, you can set this via OptionsNumbers If you are importing a STATA, SPSS or MINITAB worksheet, MLwiN should recognise the system missing codes from these software packages If you paste data in and a variable has a unique non-numeric code sequence such as “*” or “.” or “???” the unique non-numeric code will be interpreted as missing.

Graphics MLwiN has a range of graphical functionality, including various types of static and interactive visualisation to aid in model interpretation and exploration. For the moment we review some of the standard graphics available via the GraphsCustomised Graphs window

Three layers of graphics output available They can be specified by the customised graphs window. Firstly a Display. A Display can contain multiple graphs. A Display can contain up to 25 graphs. Only one display can be viewed at a time. But you can have up to ten displays and switch between them. Secondly, a graph. There are two in this display Finally, a graph can contain multiple data sets. A data set is a single variable(for histograms) and an x-y pair of variables for other plots. The left graph is a histogram of the variable normexam. The right graph contains two data sets. Y=normexam, X=standlrt (in blue)and Y=normexam X=gender(in red)

Currently selected display number 1 Which contains 3 data sets, plotted in two graphs. Currently selected data set is DS 2 Details of how particular data set is to be plotted, including which graph it is to be plotted on, is handled by right panel Filling out the customised graph window

Save your worksheet If you save a worksheet, it saves the dataandany current graphs and the current statistical model and its results and any tables of results from multiple models you have built. Save your worksheet regularly. Regard anything unsaved as hostage to fortune – either a system crash or a user “mistake” that leads to data being irreversibly re-arranged – eg deleting columns, getting variables out of alignment.

Multiple regression : a refresher In this and other sessions we will be using data from the 2002 European Social Surveys (ESS). Measures of ten human values have been constructed for 20 countries in the European Union. We will study one of the ten values, hedonism, defined as the ‘pleasure and sensuous gratification for oneself’. The scores on the hedonism variable range from -3.76 to 2.90, where higher scores indicate more hedonistic beliefs. In this session we consider the application of multiple regression to a subset of the data for three countries only:UK, Germany and France Hedonism is taken as the outcome variable in our analysis. We consider three explanatory variables: • Age in years • Gender (coded 0 for male and 1 for female) • Country (coded 1 for the UK, 2 for Germany and 3 for France)

y3 y5 y y1 Intercept : height of line at x=0 Slope: increase in y for 1 unit increase in x y4 y2 Residual: departure of point from predicted line x -3 -2 -1 0 1 2 3 Regression with a single continuous explanatory variable Line of “best fit” through the data Ordinary least squares estimates 0and 1to minimise the sum of the squared values of ei

Terminology Y : response variable, outcome variable, dependent variable X : explanatory variable, predictor variable, independent variable

Linear regression with a continuous predictor : Research questions Is there an association between y and x? For example in the values data set is there an association between hedonism( y) and age( x)

Interpretation For every year increase in age hedonism decreases by 0.018 units At age=0(x=0) the average hedonism level is 0.712. The notion of the hedonism score of a newly born baby, where hedonism is measured by answers to survey questions put to people in the age range 14..98 years is not very meaningful.

Centering When an x value of 0 is outside the range of x and therefore the interpretation of the intercept is not meaningful, people often center the x variable. In our data set we can center age around its average value of 46 years. This gives intercept and slope estimates of Note that centering a predictor variable does not change the estimate of the slope or the position of the regression line through the data

Linear regression with a continuous explanatory variable : Assumptions 1. Independence. The residuals(ei) are assumed to be independent of each other. This means that knowing the value of the residual for one person tells us nothing about the value of a residual for any other person. The residuals are assumed to be independent of x. That is cov(xi , ei)=0. 2. The residuals follow a Normal distribution that is 3. The variance of the residuals is constant wrt to x. This is known as homoskedasticity.

y y -3 -2 -1 0 1 2 3 x x Constant variance assumption -3 -2 -1 0 1 2 3 Residuals variance constant wrt to x: homoskedasticity Residuals variance not constant wrt to x: heteroskedasticity

Checking the model assumptions We can evaluate the validity of assumption 2) and 3) by use of diagnostic Assumption 2: Normality. Standardised residuals plotted against Normal scores of standardised residuals should lie on a straight line Assumption 3: Constant variance. Vertical scatter of points should be roughly the same for any value of x Assumption 1: If we suspect residuals are not independent of each other then we can fit more complex models to test this: for example a multilevel model.

Hypothesis testing : p values Null hypothesis: that there is no relationship between hedonism and age in the population(1 = 0) and the relationship we observe in the sample could have arisen by chance. Alternative hypothesis: there is a relationship in the population(1  0). The standard error is a measure of the imprecision of our estimates (as the standard error gets smaller the precision of our estimates increases). In our example SE(1)=0.001. We can look at Z or t ratio : Which yields a p-value 0.001. Which says if there were no relationship in the population between hedonism and age we would expect less than 0.1% of samples to produce a slope estimate of magnitude greater than 0.018. Note that the SE decreases with n so that with large enough samples any effect becomes statistically significant.

Hypothesis testing:confidence intervals Alternatively, but equivalently, we can construct a 95% CI for 1 Zero (the value of β1 under the null hypothesis) is well outside the 95% confidence interval, so we reject the null hypothesis and conclude that the relationship is statistically significant at the 5% level. Note -1.96 and +1.96 are the 2.5% and 97.5% points on a standard Normal distribution.

Comparing groups : regression with a single categorical predictor Suppose we fit the regression model where yiis the hedonism score of individual i, and xi=1 if the individual is female, and 0 if the respondent is male. We then obtain The predictions for men and women are The difference between men and women has a z-ratio of -0.156/0.025 and we would reject the null hypothesis of the male and female means being equal and the 95% CI for 1 is (-2.06,-0.106).

Comparing groups with more than two categories We used two different parameterisations to estimate the two gender means. Generally, the first parameterisation, where the intercept is multiplied by a constant vector of 1’s is preferred. This is because when we add multiple predictors into the model, interpretation of the coefficients is more straightforward. For every extra category in a predictor variable we need to include an extra indicator or dummy variable in our model. With an n-category variable we need to include n-1 indicator variables in addition to the intercept term to model the means of the n groups. For example to model the three country means(UK, France and Germany). We can fit the model

Country difference in hedonism For UK residents (Germany=0, France=0): Predicted hedonism = −0.384 + (0.256 × 0) + (0.492 × 0) = −0.384 For German residents(Germany=1, France=0): Predicted Hedonism= −0.384 + (0.256 ×1) + (0.492 × 0) = −0.128 For French residents (GERM=0, FRANCE=1): Predicted Hedonism = −0.384 + (0.256 × 0) + (0.492 ×1) = 0.108

Hypothesis testing for categorical predictors with more than two groups What if we want to test the France/Germany difference? We could reparameterise the model so that Germany, instead of the UK, was the reference category. Or we could conduct a wald test on the equality

More than one predictor variable-statistical control When modelling the effects of the country predictor variable we already entered multiple dummy or indicator variables. We can add multiple predictor variables into our model, where categorical predictor variables will be handled by a set of dummy variables and continuous predictor variables will be handled by including the variable directly Once we include more than one predictor variable our model can address the issue of statistical control Does the association of one predictor variable with the response persist when we simultaneously account for further predictor variables?

Example of statistical control with the hedonism data We have already seen that : Women are less hedonistic than men Hedonism decreases with age However, women live longer than men. So some of the gender gap will be due to the fact that women are on average older than men. Some but how much? We can answer this question by fitting age and gender in the same model. This will tell us if the gender gap persists after controllingfor age.

Modelling gender and age simultaneously The gender effect in the model where gender is only the predictor variable is -0.156. So the gender effect persists strongly after controlling for age.

0+ 2priori School B 1 School A School A attainment attainment 1 0 School B Prior ability Prior ability Statistical control: another example Imagine attainment scores on two schools Fitting school as a single predictor But controlling for prior ability… School B has -ve effect School B has +ve effect

Interactions between predictor variables Recall our model with age and gender effects.. It may be that the gender gap changes as a function of age Or equivalently The age slope is not the same for men and women. We can test for this by including an interaction between age and female as an extra explanatory variable in the model We do this by including a variable that is the product of age and female

Gender x Age interaction effects Results 1 1 10 10 1 0 15 0 …. This gives a prediction line for males( female=0) of -0.058 – 0.1530 - 0.019agei + 0.02  0  agei = -0.058 – 0.019agei and for females(female=1) of -0.058 – 0.1531 - 0.019agei + 0.02  1  agei=(-0.058-0.153)+(-0.019+0.002)agei That is females have an intercept -0.153 lower than males and a slope 0.002 greater than males Note the gender difference in the slopes 0.002 has a z-ratio of 2 so is just statistically significant at the 5% level

Graphing the lines male : 0.058 – 0.019agei female : (-0.058-0.153)+(-0.019+0.02)agei The slightly flatter (less negative) slope for females means the gender gap decreases with age

Recall the gender slope difference 0.002 was only just statistically significant at the 5% level Examining the gender gap We may want to know does the gender gap remain statistically significant even at higher ages when it is diminished? The gender gap is: So we can plot this function out with its associated confidence envelope and see for which ages the confidence interval does not include 0(no gender gap)

Graphing the gender gap with 95% CI The gender gap becomes statistically insignificant at age-46 = 30 that is at 76 years

MLwiN a few pointers

MLwiN a few pointers

Presentation Transcript

Pointers

Pointers

MCMC estimation in MlwiN

Pointers

Pointers

Pointers

many, a few , very few, few

Pointers

Pointers

POINTERS

Pointers

Pointers

Pointers

Pointers

Pointers

Pointers

Pointers

MCMC estimation in MlwiN

Pointers

pointers

Pointers

Pointers