290 likes | 469 Views
Model Development and Selection of Variables . Animal Science 500 Lecture No. 11 October 7, 2010. Class Statement. Variables included in the CLASS statement referred to as class variables. Specifies the variables whose values define the subgroup combinations for the analysis.
E N D
Model Development and Selection of Variables Animal Science 500 Lecture No. 11 October 7, 2010
Class Statement • Variables included in the CLASS statement referred to as class variables. • Specifies the variables whose values define the subgroup combinations for the analysis. • Represent various level of some factors or effects • Treatment (1,….n) • Season (spring, summer, fall, and winter coded 1 through 4) • Breed • Color • Sex • Line • Day • Laboratory
Class Variables • Are usually things you would like to account for in your model • Can be numeric or character • Can be continuous values • They are generally not used in regression analyses • What meaning would they have
Class Statement Options • Ascending sorts class variable in ascending order • Descending sorts class variable in descending order Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all
Discrete Variables • A discrete variable is one that cannot take on all values within the limits of the variable. • Limited to whole numbers • For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. • The variable cannot have the value 1.7. A variable such as a person's height can take on any value. Discrete variables also are of two types: • unorderable (also called nominal variables) • orderable (also called ordinal)
Discrete Variables • Data sometimes called categorical as the observations may fall into one of a number of categories for example: • Any trait where you score the value • Lameness scores • Body condition scores • Soundness scoring • Reproductive • Feet and leg • Behavioral traits • Fear test • Back test • Vocal scores • Body lesion scores
Discrete Variables • When do discrete variables become continuous or do they? • What is a trait like number born alive considered discrete or continuous?
Model Development and Selection of Variables Example: The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.
Assumptions of the Linear Regression Model • Linear Functional form • Fixed independent variables • Independent observations • Representative sample and proper specification of the model (no omitted variables) • Normality of the residuals or errors • Equality of variance of the errors (homogeneity of residual variance) • No multicollinearity • No autocorrelation of the errors • No outlier distortion
Explanation of the Assumptions • Linear Functional form • Does not detect curvilinear relationships • The Observations are Independent observations • Representative sample from some larger population • If the observations are not independent results in an autocorrelation which inflates the t and r and f statistics which in turn distorts the significance tests • Normality of the residuals • Permits proper significance testing similar to ANOVA and other statistical procedures • Equal variance (or no heterogenous variance) • Heteroskedasticity precludes generalization and external validity • This too distorts the significance tests being used • Multicollinearity(many of the traits exhibit collinearity) • Biases parameter estimation. • Can prevent the analysis from running or converging (getting your answers) • Severe or several outliers will distort the results and may bias the results. • If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
Example Data Origination (Dr. P. J. Berger) Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.
Example Variables Data: The dependent variable (what is being measured) is aerial biomass and there are five substrate measurements: (These are the independent variables) • Salinity, • Acidity, • Potassium, • Sodium, and Zinc. • Objective:
Example Data • Objective: • Find the substrate variable, or combination of variables, showing the strongest relationship to biomass. Or, • From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass. • Find the independent variables that can be used to predict aerial biomass.
Example analysis • The REG Procedure INTRODUCTION • The REG procedure fits least-squares estimates to linear regression models. • SPECIFICATIONS • PROC REG; • MODEL dependents = regressors / options;
Example analysis • The RSQUARE Procedure RECALL • The RSQUARE procedure selects optimal subsets of independent variables in a multiple regression analysis
Example analysis PROC RSQUARE options; MODEL dependents = independents / options; (options can appear in either PROC RSQUARE or any MODEL statement). • SELECT = n specific maximum number of subset models • INCLUDE = I requests that the first I variables after the equal sign be included in every regression • SIGMA = n specifies the true standard deviation of the error term • ADJRSQ computes R2 adjusted for degrees of freedom • CP computes MALLOWS’ Cp statistic
Example analysis PROC RSQUARE options; MODEL dependents = independents / options; (options can appear in either PROC RSQUARE or any MODEL statement). PROC RSQUARE DATA=name OUTEST=EST ADJRSQ MSE CP; SELECT=n; MODEL = variable list;
Example analysis PROC PRINT DATA=EST; PROC PLOT; PLOT _CP_*_P_ = ‘C’ _P_*_P_ = ‘P’ / OVERLAY; PLOT _MSE_*_P_ = ‘M’; Run; Quit
PROC STEPWISE The STEPWISE procedure provides five methods for stepwise regression. General form: PROC STEPWISE; MODEL dependents = independents / options; Run; Quit; ** Assumes that you have at least one dependent variable and 2 or more independent variables. If only one independent variable exists then you are just doing a simple regression of x on y or y on x.
Types of Regression • Uses of PROC REG for standard problems: • PROC REG; /* simple linear regression */ model y = x; • PROC REG; /* weighted linear regression */ model y = x; weight w; • PROC REG; /* multiple regression */ model y = x1 x2 x3;
PROC REG General form: PROC REG; MODEL dependents = independents / options; Options available include: NOINT – regression with no intercept FORWARD A forward selection analysis starts out with no predictors in the model. Each predictor that that was chosen by the user is evaluated with respect to see how much the R2 is increased by adding it to the model. The predictor that increases the R2will be added if it meets the statistical conditions for entry With SAS the statistical conditions is the significance level for the increase in the R2 produced by addition of the predictor. If no predictor meets the condition, the analysis stops. If a predictor is added, then the second step involves re-evaluating all of the available predictors which have not yet been entered into the model. If any satisfy the statistical condition for entry, the predictor increasing the R2 the greatest is added. This process is continued until no predictors remain that could enter.
PROC REG General form: PROC REG; MODEL dependents = independents / options; Options available include: BACKWARD In a backwards elimination analysis we start out with all of the predictors in the model. At each step we evaluate the predictors which are in the model and eliminate any that meet the criterion for removal. STEPWISE Stepwise selection begins similar to forwards selection. However at each “step” variables that are in the model are first evaluated for removal. Those meeting removal criteria are evaluated to see which would lower the R2, the least. How does this work where a variable enters and then might leave later? If two predictors ultimately enter the model, one may be removed because they are well correlated and removing one impacts the R2 very little if at all.
PROC REG General form: PROC REG; MODEL dependents = independents / options; Options available include: MAXR The maximum R2 option does not settle on a single model. Instead, it tries to find the "best" one-variable model, the "best" two-variable model, and so forth. , MAXR starts out by finding the single variable model producing the greatest R2 After finding the one variable MAXR then another variable is added until it finds the variable that increases the R2the most. It continues this process until it stops where the addition of another variable is no better than the previous (i.e. adding the 4th variable did not significantly improve the R2 compared to the 3 variable model for example. The difference between the STEPWISE and MAXR options is that all switches are evaluated before any switch is made in the MAXR method . Using the STEPWISE option, the "worst" variable may be removed without considering what adding the "best" remaining variable might accomplish.
PROC REG General form: PROC REG; MODEL dependents = independents / options; Options available include: MINR The MINR option closely resembles the MAXR method. However, the switch chosen with the MINR option is switch that produces the smallest increase in R2. In a way approaching the “best” model in reverse compared to MAXR.
PROC REG General form: PROC REG; MODEL dependents = independents / options; Options available include: SLE=value This option sets some criterion for entry into the model. This can be defined by the user by meeting some level of change or Δ to the R2 SLS=value This option sets some criterion for staying or remaining in the model. This can be defined by the user by meeting some level of change or Δ to the R2 to stay in the model.
PROC REG • The default statistical levels for each type of regression analysis is different unless it is changed by the user: • The defaults are: BACKWARD = 0.10 FORWARD = 0.10 STEPWISE = 0.15 User can set it by using the SLSTAY option for example / SLSTAY=.05.
Significance Tests for the Regression Coefficients • Finding the significance of the parameter estimates by using the F or t test(will see in a couple of slides) • R2 = R-Square is the proportion of variation in the dependent variable (Y) that can be explained by the predictors (X variables) in the regression model. • Adjusted R2Predictors could be added to the model which would continue to improve the ability of the predictors to explain the dependent variable. Some of the improvement in the R-Square would be simply due to chance variation. The adjusted R-Square attempts to yield a more honest value to estimate R-Square. = 1-(1-R2) (n-1)/(n-p-1) where R2 = the unadjusted R2 n = the number of number of observations, and p = the number of predictors
Significance Tests for the Regression Coefficients • The Mallows’ Cp statistic • CP (Cp) = SSE / σ2 + 2p – n where SSE = error sums of squares σ2 = the estimate of pure error variance from the SIGMA = option for from fitting the full model p = the number of parameters including the intercept, and n = the number of observations