290 likes | 794 Views
Quantitative data analysis. Module in research methods course for tourism program Reza Mortazavi 2014 Lecture 4. Relationship between variables. Correlation When two variables are linearly related (or covary ) we say they are correlated either positively or negatively.
E N D
Quantitative data analysis Module in research methods course for tourism program Reza Mortazavi 2014 Lecture 4
Relationship between variables • Correlation • When two variables are linearly related (or covary) we say they are correlated either positively or negatively. • Correlation is not causation! • Open the file PLdata.dta • sum incd1000 age • scatter incd1000 age, yline(121.6) xline(22.9)
Correlation • Interpret the scatterplot. • twoway scatter incd1000 age,by(female) • Correlation coefficient • Measures the strength of linear association between two variables. Is between [-1,1]. • pwcorrincd1000 age,sig • H0:no correlation. • pwcorr incd1000 female,sig
Correlation • pwcorr incd1000 education,sig • pwcorr incd1000 totexpdayage,sig star(0.05) Interpret the output! • gen neg5incd=-5*incd1000 • What do we expect in terms of correlation between them? • scatter neg5incd incd1000 • pwcorr neg5incd incd1000,sig
Correlation • gen x=rnormal() • gen y=rnormal() • What do we expect? (two independent variables have been drawn randomly…) • pwcorr x y,sig
Correlation • Zero correlation does not mean independence • gen seq = int((_n-_N/2)) • gen seqsq=seq^2 • scatter seqsq seq • pwcorr seqsq seq
Caution • Correlation does not imply causation. • Statistical significance is not the same as practical significance. • Use common sense when interpreting and drawing conclusions. • Correlation is about linear association • Use scatterplot to discover possible nonlinear association.
Some details • Normally distributed data are assumed. • The correlation coefficient is sensitive to outliers (extreme values) • Sometimes transformations (e.g. logarithmic) of non-normally distributed data are normal • Non-normal data may be converted into ordinal (ranked) data and non-parametric test, Spearman’s rank correlation, may be used.
Regression analysis • Note that the purpose is not to go into all details regarding regression analysis. Even though there are a couple of slides with some algebraic expressions the exposition is not intended to be technical. • The purpose is, however, to cover the basics so that you can run your own regression analysis using software and present, interpret and discuss results.
Purpose of Regression Analysis • Estimatea relationshipamong some variables, such as y = f(x). Here y is the dependent and x is the independent variable. For example, food consumption or tourism demand depends on income. 2. Forecast or predict the value of one variable, y, based on the value of another variable, x.
Terminology • Y is called dependent variable, response variable, explained variable, output variable or regressand. • X’s are called independent variable, predictor variable, explanatory variable, input variable or regressor. • A model is an abstraction from reality. It is a simplified representation focusing on some features while ignoring details.
Weekly food expenditure y = dollars spent each week on food items. x = consumer’s weekly income. The relationship between x and the expected value of y , given x, might belinear: E(y|x) = b1 + b2 x
f(y|x) f(y|x=480) f(y|x=800) my|x=480 my|x=800 y Probability Distribution of Food Expenditures given income x=$480 and x=$800.
Average Expenditure E(y|x) E(y|x)=b1+b2x DE(y|x) b2= DE(y|x) Dx Dx { b1 x (income) a linear relationship between average expenditure on food and income.
The population parametersb1andb2are unknown population constants. The formulas that produce thesample estimates b1 and b2 arecalled the estimators of b1andb2. When b1 and b2 are used to representthe formulas rather than specific values,they are called estimators of b1andb2which are random variables becausethey are different from sample to sample.
Simple regression: an example • twoway (scatter totexpday incd1000) (lfittotexpday incd1000) • regress totexpday incd1000 • What is the “intercept” here? What does it mean? • What is the “slope” here? What does it mean? • Interpret your estimated model!
Simple regression: an example • In interpreting the results you have to be careful about what are the units of measurements • regress totexpdayinccont • What is the “intercept” here? What does it mean? • What is the “slope” here? What does it mean? Compare with the previous model.
Simple regression: an example • Hypothesis tests: • Is income (statistically) significantly related to visitors expenditures? • The output table gives us several ways to answer this question. • How good is our model? • R-squared • R-squared = 0.0575 in our example. How can we interpret this number?
Simple regression: an example • Can we make a prediction of the totexpday for say an average person earning 200000 SEK per year? Well: 411.123+ 1.03526*200= 618. 18 This is a point (prediction) estimate. We can calculate say a 95% confidence (prediction) interval. 95 % PI: (570.1205-666.2293)
Exercises on simple regression • regr incd1000 age • regr incd1000 education Iterpretthe results!