650 likes | 873 Views
Regression. Jennifer Kensler. Laboratory for Interdisciplinary Statistical Analysis. LISA helps VT researchers benefit from the use of Statistics. Experimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...). Walk-In Consulting
E N D
Regression Jennifer Kensler
Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use ofStatistics Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...) Walk-In Consulting Monday—Friday 12-2PM for questions requiring <30 mins Collaboration From our website request a meeting for personalized statistical advice Great advice right now:Meet with LISA before collecting your data Short Courses Designed to help graduate students apply statistics in their research All services are FREE for VT researchers. We assist with research—not class projects or homework. www.lisa.stat.vt.edu
Topics • Simple Linear Regression • Multiple Linear Regression • Regression with Categorical Variables
Simple Linear Regression • Simple Linear Regression (SLR) is used to model the relationship between two continuous variables. • Scatterplots are used to graphically examine the relationship between two quantitative variables. Sullivan (pg. 193)
Types of Relationships Between Two Continuous Variables • Positive and negative linear relationships
Types of Relationships Between Two Continuous Variables • Curved Relationship • No Relationship
Correlation • The Pearson Correlation Coefficient measures the strength of a linear relationship between two quantitative variables. The sample correlation coefficient is where and are the sample means of the x and y variables respectively, and and are the sample standard deviations of the x and y variables respectively.
Properties of the Correlation Coefficient • Positive values of r indicate a positive linear relationship. • Negative values of r indicate a negative linear relationship. • Values close to +1 or -1 indicate a strong linear relationship. • Values close to 0 indicate that there is no linear relation between the variables. • We only use r to discuss linear relationships between two variables. • Note: Correlation does not imply causation.
Simple Linear Regression Can we describe the behavior between the two variables with a linear equation? • The variable on the x-axis is often called the explanatory or predictor variable. • The variable on the y-axis is called the response variable.
Simple Linear Regression • Objectives of Simple Linear Regression • Determine the significance of the predictor variable in explaining variability in the response variable. • (i.e. Is per capita GDP useful in explaining the variability in life expectancy?) • Predict values of the response variable for given values of the explanatory variable. • (i.e. if we know the per capita GDP can we predict life expectancy?) • Note: The predictorvariable does not necessarily cause the response.
Simple Linear Regression Model • The Simple Linear Regression model is given by where is the response of the ith observation is the y-intercept is the slope is the value of the predictor variable for the ith observation is the random error
SLR Estimation of Parameters • The equation for the least-squares regression line is given by where is the predicted value of the response for a given value of x
The Residual • The residual is the observed value of y minus the predicted value of y. • The residual for observation i is given by
Simple Linear Regression Assumptions • Linearity • Observations are independent • Based on how data is collected. • Check by plotting residuals in the order of which the data was collected. • Constant variance • Check using a residual plot (plot residuals vs. ). • The error terms are normally distributed. • Check by making a histogram or normal quantile plot of the residuals.
Diagnostics: Residual Plot • A residual plot is used to check the assumption of constant variance and to check model fit (is a line a good fit). • Good residual plot: no pattern
Diagnostics • Left: Residuals show non-constant variance. • Right: Residuals show non-linear pattern.
Diagnostics: Normal Quantile Plot • Left: Residuals are not normal • Right: Normality assumption appropriate
ANOVA Table for Simple Linear Regression The F-test tests whether there is a linear relationship between the two variables. Null Hypothesis Alternative Hypothesis
Test for Parameters • Test whether the true y-intercept is different from 0. • Test whether the true slope is different from 0. • Note: For simple linear regression this test is equivalent to the overall F-test.
Coefficient of Determination • The coefficient of determination, , is the percent of variation in the response variable explained by the least squares regression line. • Note: • We also have
Muscle Mass Example • A nutritionist randomly selected 15 women from each ten year age group beginning with age 40 and ending with age 79. The nutritionist recorded the age and muscle mass of each women. The nutritionist would like to fit a model to explore the relationship between age and muscle mass. (Kutner et al. pg. 36)
JMP: Making a Scatterplot • To analyze the data click Analyze and then select Fit Y by X.
JMP: Making a Scatterplot • As shown below Y, Response: Muscle Mass X, Factor: Age
JMP: Scatterplot • This results in a scatter plot.
JMP: Simple Linear Regression • To perform the simple linear regression click on the Red Arrow and then select Fit Line.
Simple Linear Regression Results • The results on the right are displayed.
JMP: Diagnostics • Click on the Red Arrow next to Linear Fit and select Plot Residuals.
Diagnostic Plots • The plots to the right are then added to the JMP output.
Multiple Linear Regression • Similar to simple linear regression, except now there is more than one explanatory variable. • Body fat can be difficult to measure. A researcher would like to come up with a model that uses the more easily obtained measurements of triceps skinfold thickness, thigh circumference and midarm circumference to predict body fat. (Kutner et al. pg. 256)
First Order Multiple Linear Regression Model • The multiple linear regression model with p-1 independent variables is given by where are parameters are known constants
Multiple Linear Regression ANOVA Table The ANOVA F-test tests Tests can also be performed for individual parameters. (i.e. vs.
Coefficient of Multiple Determination • The coefficient of multiple determination, , is the percent of variation in the response y explained by the set of explanatory variables. • The adjusted coefficient of determination, , introduces a penalty for more explanatory variables.
Assumptions of Multiple Linear Regression • Observations are independent • Based on how data is collected (plot residuals in the order of which the data was collected). • Constant variance • Check using a residual plot (plot residuals vs. , plot residuals vs. each predictor variable). • The error terms are normally distributed. • Check by making a histogram or normal quantile plot of the residuals.
Commercial Rental Rates • A real estate company would like to build a model to help clients make decisions about properties. The company has information about rental rate (Y), age (X1), operating expenses and taxes (X2), vacancy rates (X3), and total square footage (X4). The information is regarding luxury real estate in a specific location. (Kutner et al. pg. 251)
JMP: Commercial Rental Rates • First, examine the data. Click Analyze, then Multivariate Methods, then Multivariate.
JMP: Scatterplot Matrix • For Y, Columns enter Y, X1, X2, X3 and X4. Then click OK.
JMP: Fitting The Regression Model • Click Analyze and then select Fit Model.
JMP: Fitting the Regression Model • Y: Y, Highlight X1, X2, X3 and X4 and click Add. Then click Run.
Fitting the Model • Examining the parameter estimates we see that X3 is not significant. • Fit a new model this time omittingX3.
JMP: Checking Assumptions • Included output • Need residuals: • Click the red arrow next to Y Response → Save Columns → Residuals
JMP: Check Normality Assumption • Analyze → Distribution → Y, Columns: Residual Y • Click the red arrow next to Distribution Residual Y and select Normal Quantile Plot.
JMP: Checking Residuals vs. Independent Variables • Analyze →Fit Y by X → Y, Columns: Residual Y X, Factor: X1, X2, X4
Other Multiple Linear Regression Issues • Outliers • Higher Order Terms • Interaction Terms • Multicollinearity • Model Selection
Regression with Categorical Variables • Sometimes there are categorical explanatory variables that we would like to incorporate into our model. • Suppose we would like to model the profit or loss of banks last year based on bank size and type of bank (commercial, mutual savings, or savings and loan). (Kutner et al. pg. 340)