250 likes | 417 Views
Model Development and Selection of Variables . Animal Science 500 Lecture No. 17 October 28, 2010. Using PROC COMPARE. PROC COMPARE compares two SAS datasets with each other. It warns you if it detects observations (rows) or variables (columns) that do not agree across the two datasets.
E N D
Model Development and Selection of Variables Animal Science 500 Lecture No. 17 October 28, 2010
Using PROC COMPARE • PROC COMPARE compares two SAS datasets with each other. • It warns you if it detects observations (rows) or variables (columns) that do not agree across the two datasets. • When there are no disagreements, you can be confident that data entry is reliable. • To use PROC COMPARE, enter your data twice, once each into two separate raw data files. • Next use the two raw data files to create two SAS data sets. • Then use PROC COMPARE.
Using PROC COMPARE • Example: The following example compares the two SAS data sets named PIG1 and PIG12. • PROC COMPARE BASE = PIG1 COMPARE = PIG12 ERROR ; ID subjctid ; • The BASE keyword defines the data set that SAS will use as a basis for comparison. • The keyword COMPARE defines the dataset which SAS will compare with the base dataset. • The ERROR keyword requests that SAS print an error message to the SASLOG file if it discovers any differences when it compares the two data sets.
Using PROC COMPARE • The ID statement tells SAS to compare rows (observations) in the data set by the identifying variable, which here is named SUBJCTID. This variable must have a unique value for each case. • PROC COMPARE features a number of options, many of which are designed to control the amount and type of information displayed in the listing file.
Class Statement • Variables included in the CLASS statement referred to as class variables. • Specifies the variables whose values define the subgroup combinations for the analysis. • Represent various level of some factors or effects • Treatment (1,….n) • Season (spring, summer, fall, and winter coded 1 through 4) • Breed • Color • Sex • Line • Day • Laboratory
Class Variables • Are usually things you would like to account for in your model • Can be numeric or character • Can be continuous values • They are generally not used in regression analyses • What meaning would they have
Class Statement Options • Ascending sorts class variable in ascending order • Descending sorts class variable in descending order Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all
Discrete Variables • A discrete variable is one that cannot take on all values within the limits of the variable. • Limited to whole numbers • For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. • The variable cannot have the value 1.7. A variable such as a person's height can take on any value. Discrete variables also are of two types: • unorderable (also called nominal variables) • orderable (also called ordinal)
Discrete Variables • Data sometimes called categorical as the observations may fall into one of a number of categories for example: • Any trait where you score the value • Lameness scores • Body condition scores • Soundness scoring • Reproductive • Feet and leg • Behavioral traits • Fear test • Back test • Vocal scores • Body lesion scores
Discrete Variables • When do discrete variables become continuous or do they? • What is a trait like number born alive considered discrete or continuous?
Model Development and Selection of Variables Example: The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.
Assumptions of the Linear Regression Model • Linear Functional form • Fixed independent variables • Independent observations • Representative sample and proper specification of the model (no omitted variables) • Normality of the residuals or errors • Equality of variance of the errors (homogeneity of residual variance) • No multicollinearity • No autocorrelation of the errors • No outlier distortion
Explanation of the Assumptions • Linear Functional form • Does not detect curvilinear relationships • The Observations are Independent observations • Representative sample from some larger population • If the observations are not independent results in an autocorrelation which inflates the t and r and f statistics which in turn distorts the significance tests • Normality of the residuals • Permits proper significance testing similar to ANOVA and other statistical procedures • Equal variance (or no heterogenous variance) • Heteroskedasticity precludes generalization and external validity • This too distorts the significance tests being used • Multicollinearity(many of the traits exhibit collinearity) • Biases parameter estimation. • Can prevent the analysis from running or converging (getting your answers) • Severe or several outliers will distort the results and may bias the results. • If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
Example Data Origination (Dr. P. J. Berger) Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.
Example Variables Data: The dependent variable (what is being measured) is aerial biomass and there are five substrate measurements: (These are the independent variables) • Salinity, • Acidity, • Potassium, • Sodium, and Zinc.
Example Data • Objective: • Find the substrate variable, or combination of variables, showing the strongest relationship to biomass. Or, • From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass. • Find the independent variables that can be used to predict aerial biomass.
Definition of Mixed Models by their component effects • Mixed Models contain both fixed and random effects • Fixed Effects: factors for which the only levels under consideration are contained in the coding of those effects • Random Effects: Factors for which the levels contained in the coding of those factors are a random sample of the total number of levels in the population for that factor.
Examples of Fixed and Random Effects • Fixed effect: • Sex where both male and female genders are included in the factor, sex. • Breed: Pure or Crossbred or Angus, Hereford, and Charlois are examples that would be included in the factor of breed • Random effect: • Subject: the sample is a random sample of the target population
Defining fixed or random factor From D. A. Dickey, 2008: SAS Global Forum
Classification of effects • There are main effects: Linear Explanatory Factors • There are interaction effects: Joint effects over and above the component main effects. • There are nested effects. Hierarchical designs contained nested effects: Animals may be nested witin treatment that might be nested within farm. • Such effects may sometimes be fixed or random. Their classification depends on the experimental design
Classification of effects • Between-subjects effects are those who are in one group or another but not in both. • Experimental group is a fixed effect because the manager is considering only those groups in his experiment. • One group is the experimental group and the other is the control group. Therefore, this grouping factor is a between- subject effect. • Within-subject effects are experienced by subjects repeatedly over time.
Classification of effects • Trial is a random effect when there are several trials in the repeated measures design; all subjects experience all of the trials. • Trial is therefore a within-subject effect. • Example an operator of a scanning machine may be a fixed or random effect, depending upon whether one is generalizing beyond the sample • If ultrasound scanner operator is a random effect, then the machine*operator interaction is a random effect. • There are contrasts: These contrast the values of one level with those of other levels of the same effect.
Classification of Effects cont’d • Hierarchical designs have nested effects. • Nested effects are those with subjects within groups. • An example would be pen of animals nested within barn and barns nested within farms • SAS expresses nesting of effects by: • Pen of animals(barn) • Barn(farms)
Interactions case • If an interaction term were included, the formula would be yij = μ + αi + βi + αβij + eij • The interaction or crossed effect is the joint effect, over and above the individual main effects. Therefore, the main effects must be in the model for the interaction to be properly specified. αβij= (yij - μ) – ( α – μ) –(β – μ) = yij - α - β + μ
Higher Order Interactions • If 3-way interactions are in the model, then the main effects and all lower order interactions must be in the model for the 3-way interaction to be properly specified. For example, a 3-way interaction model would be: yijk = μ + ai + bj + ck + abij + acik + bcjk + abcijk+ eijk