1 / 48

Presentation and Data http:// www.lisa.stat.vt.edu Short Courses Intro to SAS Download Data to Desktop

Presentation and Data http:// www.lisa.stat.vt.edu Short Courses Intro to SAS Download Data to Desktop. Introduction to SAS Part 2. Mark Seiss , Dept. of Statistics. Reference Material. The Little SAS Book – Delwiche and Slaughter SAS Programming I: Essentials

stan
Download Presentation

Presentation and Data http:// www.lisa.stat.vt.edu Short Courses Intro to SAS Download Data to Desktop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation and Data • http://www.lisa.stat.vt.edu • Short Courses • Intro to SAS • Download Data to Desktop

  2. Introduction to SAS Part 2 Mark Seiss, Dept. of Statistics

  3. Reference Material • The Little SAS Book – Delwiche and Slaughter • SAS Programming I: Essentials • SAS Programming II: Manipulating Data with the DATA Step • Presentation and Data • http://www.lisa.stat.vt.edu

  4. Presentation Outline Part 1 1. Introduction to the SAS Environment 2. Working With SAS Data Sets Part 2 1. Summary Procedures 2. Basic Statistical Analysis Procedures

  5. Presentation Outline • Questions/Comments • Individual Goals/Interests

  6. Print Procedure Plot Procedure UnivariateProcedure Means Procedure Freq Procedure Summary Procedures

  7. PROC PRINT is used to print data to the output window By default, prints all observations and variables in the SAS data set General Form: PROC PRINT DATA=input_data_set <options> <optional SAS statements>; RUN; Some Options input_data_set (obs=n) - Specifies the number of observations to be printed in the output NOOBS - Suppresses printing observation number LABEL - Prints the labels instead of variable names Print Procedure

  8. Optional SAS statements BY variable1 variable2 variable3; Starts a new section of output for every new value of the BY variables ID variable1 variable2 variable3; Prints ID variables on the left hand side of the page and suppresses the printing of the observation numbers SUM variable1 variable2 variable3; Prints sum of listed variables at the bottom of the output VAR variable1 variable2 variable3; Prints only listed variables in the output Print Procedure

  9. Assignment Use PROC PRINT to print out the state variable separately for each region Note: All procedures for the remainder of the course will be run on the data set work.state_data. Print Procedure

  10. Solution procsort data=state_data; by region; run; procprint data=state_data; varstate; by region; run; Print Procedure

  11. Used to create basic scatter plots of the data Use PROC GPLOT or PROC SGPLOT for more sophisticated plots General Form: PROC PLOT DATA=input_data_set; PLOT vertical_variable * horizontal_variable/<options>; RUN; By default, SAS uses letters to mark points on plots A for a single observation, B for two observations at the same point, etc. To specify a different character to represent a point PLOT vertical_variable * horizontal variable = ‘*’; Plot Procedure

  12. To specify a third variable to use to mark points PLOT vertical_variable * horizontal_variable = third_variable; To plot more than one variable on the vertical axis PLOT vertical_variable1 * horizontal_variable=‘2’ vertical_variable2 * horizontal_variable=‘1’/OVERLAY; Plot Procedure

  13. Assignment Use the PLOT PROCEDURE to plot SAT Verbal scores versus SAT Math Scores Use the value of the region variable to mark points Plot Procedure

  14. Solution procplot data=state_data; plot math*verbal=region; run; Plot Procedure

  15. PROC UNIVARIATE is used to examine the distribution of data Produces summary statistics for a single variable Includes mean, median, mode, standard deviation, skewness, kurtosis, quantiles, etc. General Form: PROC UNIVARIATE DATA=input_data_set <options>; VAR variable1 variable2 variable3; RUN ; If the variable statement is not used, summary statistics will be produced for all numeric variables in the input data set. Univariate Procedure

  16. Options include: PLOT – produces Stem-and-leaf plot, Box plot, and Normal probability plot; NORMAL – produces tests of Normality Univariate Procedure

  17. Assignment Use PROC UNIVARIATE to produce a normal probability plot and test the normality of the SAT Total variable and Expenditure variable Univariate Procedure

  18. Solution proc univariate data=state_data normal plot; varexpend total; run; Univariate Procedure

  19. Similar to the Univariate procedure General Form: PROC MEANS DATA=input_data_set options; <Optional SAS statements>; RUN; With no options or optional SAS statements, the Means procedure will print out the number of non-missing values, mean, standard deviation, minimum, and maximum for all numeric variables in the input data set Means Procedure

  20. Options Statistics Available Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to specify different alpha level. Means Procedure

  21. Optional SAS Statements VAR Variable1 Variable2; Specifies which numeric variables statistics will be produced for BY Variable1 Variable2; Calculates statistics for each combination of the BY variables Output out=output_data_set; Creates data set with the default statistics Means Procedure

  22. Assignment Use PROC MEANS to calculate the mean and variance of the expenditure variable for each region Means Procedure

  23. Solution procsort data=state_data; by region; run; procmeans data=state_data mean var; varexpend; by region; run; Means Procedure

  24. PROC FREQ is used to generate frequency tables Most common usage is create table showing the distribution of categorical variables General Form: PROC FREQ DATA=input_data_set; TABLE variable1*variable2*variable3/<options>; RUN; Options LIST – prints cross tabulations in list format rather than grid MISSING – specifies that missing values should be included in the tabulations OUT=output_data_set – creates a data set containing frequencies, list format NOPRINT – suppress printing in the output window Use BY statement to get percentages within each category of a variable FREQ Procedure

  25. Assignment Use PROC FREQ to find the number of states within each region FREQ Procedure

  26. Solution procfreq data=state_data; table region; run; FREQ Procedure

  27. Summary Procedures • Questions/Comments

  28. Correlation – PROC CORR Regression – PROC REG Analysis of Variance – PROC ANOVA Chi-square Test of Association – PROC FREQ General Linear Models – PROC GENMOD Statistical Analysis Procedures

  29. PROC CORR is used to calculate the correlations between variables Correlation coefficient measures the linear relationship between two variables Values Range from -1 to 1 Negative correlation - as one variable increases the other decreases Positive correlation – as one variable increases the other increases 0 – no linear relationship between the two variables 1 – perfect positive linear relationship -1 – perfect negative linear relationship General Form: PROC CORR DATA=input_data_set <options> VAR Variable1 Variable2; With Variable3; RUN; CORR Procedure

  30. If the VAR and WITH statements are not used, correlation is computed for all pairs of numeric variables Options include SPEARMAN – computes Spearman’s rank correlations KENDALL – computes Kendall’s Tau coefficients CORR Procedure

  31. Question: What is the correlation between the SAT Total variable and Expenditure variable? Is it significant? Based on previous exercises, which correlation coefficient should we use? Assignment: Use PROC CORR to find the correlation between the SAT Total variable and Expenditure Variable CORR Procedure

  32. Solution If the normality assumption is valid proccorr data=state_data/; vartotal expend; run; If the normality assumption is not valid proccorr data=state_dataspearman; vartotal expend; run; CORR Procedure

  33. PROC REG is used to fit linear regression models by least squares estimation One of many SAS procedures that can perform regression analysis Only continuous independent variables (Use GENMOD for categorical variables) General Form: PROC REG DATA=input_data_set <options> MODEL dependent=independent1 independent2/<options>; <optional statements>; RUN; PROC REG statement options include PCOMIT=m - performs principle component estimation with m principle components CORR – displays correlation matrix for independent variables in the model REG Procedure

  34. MODEL statement options include SELECTION= Specifies a model selection procedure be conducted – FORWARD, BACKWARD, and STEPWISE ADJRSQ - Computes the Adjusted R-Square MSE – Computes the Mean Square Error COLLIN – performs collinearity analysis CLB – computes confidence limits for parameter estimates ALPHA= Sets significance value for confidence and prediction intervals and tests REG Procedure

  35. Optional statements include PLOT Dependent*Independent1 – generates plot of data REG Procedure

  36. Assignment Use PROC REG to generate a multiple linear regression model Dependent Variable – SAT Total (total) Use Stepwise Selection  Possible Independent Variables Average pupil to teacher ratio (PT_ratio) Current expenditure per pupil (expend) Estimated annual salary of teachers (salary) Percentage of eligible students taking the SAT (students) REG Procedure

  37. Solution procreg data=state_data; model total=pt_ratio expend salary students/selection=stepwise; run; REG Procedure

  38. PROC ANOVA performs analysis of variance Designed for balanced data (PROC GLM used for unbalance data) Can handle nested and crossed effects and repeated measures General Form: PROC ANOVA DATA=input_data_set <options>; CLASS independent1 independent2; MODEL dependent=independent1 independent2; <optional statements>; Run; Class statement must come before model statement, used to define classification variables ANOVA Procedure

  39. Useful PROC ANOVA statement option – OUTSTAT=output_data_set Generates output data set that contains sums of squares, degrees of freedom, statistics, and p-values for each effect in the model Useful optional statement – MEANS independent1/<comparison type> Used to perform multiple comparisons analysis Set <comparison type> to: TUKEY – Tukey’sstudentized range test BON – Bonferroni t test T – pairwise t tests Duncan – Duncan’s multiple-range test Scheffe – Scheffe’s multiple comparison procedure ANOVA Procedure

  40. Question: Are there significant differences between the Match SAT scores of students from different regions? If there are significant differences, which regions are different? Assignment: Use PROC ANOVA to determine if there are significant differences in the Math SAT variable between regions Perform multiple comparisons between regions using Tukey’s Adjustment ANOVA Procedure

  41. Solution procanova data=state_data; class region; model math=region; means region/tukey; run; ANOVA Procedure

  42. PROC FREQ can also be used to perform analysis with categorical data General Form: PROC FREQ DATA=input_data_set; TABLE variable1 variable2/<options>; RUN; TABLE statement options include: AGREE – Tests and measures of classification agreement including McNemar’s test, Bowker’s test, Cochran’s Q test, and Kappa statistics CHISQ - Chi-square test of homogeneity and measures of association MEASURE - Measures of association include Pearson and Spearman correlation, gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds ratios, risk ratios, and confidence intervals FREQ Procedure

  43. PROC GENMOD is used to estimate linear models in which the response is not necessarily normal Logistic and Poisson regression are examples of generalized linear models General Form: PROC GENMOD DATA=input_data_set; CLASS independent1; MODEL dependent = independent1 independent2/ dist= <option> link=<option>; run; GENMOD Procedure

  44. DIST = - specifies the distribution of the response variable LINK= - specifies the link function from the linear predictor to the mean of the response Example – Logistic Regression DIST = binomial LINK = logit Example – Poisson Regression DIST = poisson LINK = log GENMOD Procedure

  45. Question: How do we model the probability of having a high total SAT average based on other variables in the dataset? Is the dependent variable normal, or does it have a different distribution? What link function would you specify? Assignment: Use PROC GENMOD to perform Logistic Regression on the work.state_data data set Dependent variable – upper_ind Independent variables Average pupil to teacher ratio (PT_ratio) Current expenditure per pupil (expend) Estimated annual salary of teachers (salary) Percentage of eligible students taking the SAT (students) Region (region) GENMOD Procedure

  46. Solution procgenmod data=state_data descending; class region; model upper_ind=pt_ratio expend salary students/dist=bin link=logit; run; GENMOD Procedure

  47. Statistical Analysis Procedures • Questions/Comments

  48. Attendee Questions • If time permits

More Related