360 likes | 696 Views
SAS Workshop. Data Analysis Using SAS. Hun Myoung Park, Ph.D. University Information Technology Services Center for Statistical and Mathematical Computing. Wednesday, April 2, 2014. © 2009-2010 The Trustees of Indiana University http://www.indiana.edu/~statmath
E N D
SAS Workshop Data Analysis Using SAS Hun Myoung Park, Ph.D. University Information Technology Services Center for Statistical and Mathematical Computing Wednesday, April 2, 2014 © 2009-2010 The Trustees of Indiana University http://www.indiana.edu/~statmath statmath@indiana.edu (812) 855-4740
Data Analysis Using SAS Outline • Descriptive Statistics • Chi-Square Test • Measure of Association • T-TEST • Analysis of Variance • Correlation Analysis • Ordinary Least Squares (OLS) • Binary Logit and Probit Models • Panel Data Models University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS OUTPUT DELIVERY SYSTEM • ODS controls SAS output (format, styles, etc.) • HTML format is very useful nowadays especially for data conversion and graphics • ODS HTML FILE=‘c:\temp\test.html’; • PROC …; • …; • PROC …; • ODS HTML CLOSE; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS DESCRIPTIVE STATISTICS • You MUST describe and examine data sets of interest carefully before conducting analyses. • PROC REPORT • PROC SUMMARY • PROC UNIVARIATE • PROC MEAN • PROC FREQ • PROC TABULATE • PROC PLOT • PROC CHART University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC REPORT • Provide contents and summary statistics of data sets in many flexible ways. • PROC REPORT DATA=sm.airline NOWD HEADLINE HEADSKIP; • COLUMN airline year cost output fuel; • DEFINE airline / ORDER; • DEFINE year / ORDER; • DEFINE cost / ANALYSIS MEAN; • DEFINE output / ANALYSIS MEAN; • DEFINE fuel / ANALYSIS MEAN; • BREAK AFTER airline/ OL SUMMARIZE SKIP; • RBREAK AFTER / DOL SUMMARIZE; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC SUMMARY • Provides descriptive statistics of variables. • PROC SUMMARY DATA=sm.cancer PRINT; • VAR cigar bladder lung kidney; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC MEANS • Like PROC SUMMARY, this procedure provides various descriptive statistics. PROC MEANS DATA=sm.grade7; • VAR stat math; • PROC MEANS DATA=sm.grade7 N SUM MEAN VAR; • VAR stat math; • Conduct one sample t-test • PROC MEANS DATA=sm.grade7 T STD STDERR; • VAR stat math; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC UNIVARIATE • Provides various descriptive statistics • PROC UNIVARIATE DATA=sm.airline; • VAR cost output; • RUN; • Conducts normality test and one sample t-test • PROC UNIVARIATE DATA=sm.airline NORMAL PLOT; • VAR cost; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC UNIVARIATE (Q-Q) • Provides Q-Q Plots • PROC UNIVARIATE DATA=sm.airline; • VAR cost; • QQPLOT cost /NORMAL; • RUN; • PROC CAPABILITY provides P-P Plot as well • PROC CAPABILITY DATA=sm.airline NORMAL; • VAR cost; • QQPLOT cost /NORMAL; • PPPLOT cost /NORMAL; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC FREQ • Produces frequency tables of variables listed. • PROC FREQ DATA=sm.airline; • TABLES airline year; • Produces contingency tables or cross-tables using * between variables. • PROC FREQ DATA=sm.cancer; • TABLES area*smoke / NOROW; • RUN; • NOROW, NOCOL, and NOPERCENT do not display row, column, total percents from each cell, respectively. University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC TABULATE • PROC TABULATE produces various statistics in a table form. • TABULATE can control formats and table forms in a sophisticated way. • Useful when summarizing and examining data sets. • PROC TABULATE DATA=sm.airline F=12.3; • CLASS airline; • VAR cost; • TABLE airline,cost*(N MEAN STD)*(F=9.2); • LABEL cost='Cost of Ariline'; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC PLOT • Produces a plot of two variables • PROC PLOT DATA=sm.cancer; • PLOT lung*cigar; • RUN; • PROC PLOT DATA=sm.cancer; • PLOT lung*cigar=‘%’ • kidney*cigar=“*” / OVERLAY; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PROC CHART • Produces various vertical and horizontal charts with many options. • PROC CHART DATA=sm.cancer; • HBAR cigar /TYPE=PERCENT; • PROC CHART DATA=sm.cancer; • VBAR lung / GROUP = smoke TYPE=MEAN; • PROC CHART DATA=sm.cancer; • BLOCK area / GROUP = smoke TYPE=MEAN SUMVAR=lung NOHEADER SYMBOL='X'; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS CHI-SQUARE TEST 1 • Chi-square test examines if two variables are independent. • PROC FREQ conducts chi-square test with the /CHISQ option. • PROC FREQ DATA=sm.cancer; • TABLES area*smoke /CHISQ; • The expected frequency of each cell should be greater than 5; otherwise, chi-square test is not reliable. • PROC FREQ DATA=sm.cancer; • TABLES area*smoke /CHISQ EXPECTED; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS CHI-SQUARE TEST 2 • Measure of association tells the strength of relationship. • MEASURES is needed. • PROC FREQ DATA=sm.cancer; • TABLES area*smoke /CHISQ MEASURES; • RUN; • Both variables are ordinal, read gamma (-1~1) • Otherwise (at least one variable is nominal), read lambda (0~1). University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS T-TEST 1 • T-test compares group means University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS T-TEST 2 • One sample t-test examines if the means of a variables is 0 or a constant hypothesized. • Use PROC TTEST, UNIVARIATE, and MEANS. • TITLE2 'One Sample T-Test'; • PROC TTEST H0=20 ALPHA=.01 DATA=sm.cancer; • VAR lung; • PROC UNIVARIATE MU0=20 VARDEF=DF NORMAL ALPHA=.01 DATA=sm.cancer; • VAR lung; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS T-TEST 3 • PAIRED statement for paired t-tests • Data should be arranged in the wide format. • PROC TTEST DATA=sm.cancer; • PAIRED lung*kidney; • RUN; • Use operators (* and :) • PROC TTEST H0=3 DATA=sm.cancer; • PAIRED (lung)*(kidney bladder); • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS T-TEST 4 • Independent sample t-test. • PROC TTEST H0=0 ALPHA=.05 DATA=sm.cancer; • CLASS smoke; • VAR lung; • Data arranged in the long form. • Check F-test for equal variance. Read Pooled T in case of equal variance. • PROC TTEST COCHRAN DATA=sm.cancer; • CLASS west; • VAR kidney; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ANALYSIS OF VARIANCE 1 • Use PROC ANOVA, GLM, and MIXED • PROC ANOVA DATA=sm.cancer; • CLASS smoke; • MODEL lung=smoke; • PROC GLM DATA=sm.cancer; • CLASS smoke; • MODEL lung=smoke; • PROC MIXED DATA=sm.cancer; ; • CLASS smoke; • MODEL lung=smoke; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ANALYSIS OF VARIANCE 2 • PROC ANOVA can handle balanced data while GLM and MIXED can handle balanced and unbalanced data. • GLM and MIXED are generally recommended for complex models. • PROC GLM DATA=sm.cancer; • CLASS smoke area; • MODEL lung=smoke area /SS3; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS CORRELATION ANALYSIS • Karl Pearson correlation coefficients for interval variables • PROC CORR DATA=sm.airline PEARSON COV; • VAR cost output fuel load; • RUN; • For ordinal variables, add SPEARMAN and/or KENDALL options to CORR statement instead of PEARSON. • PROC CORR DATA=sm.airline SPEARMAN KENDALL; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ORDINARY LEAST SQUARES 1 • Classical linear regression model or ordinary least squares (OLS). • Has many strong assumptions such as linearity, constant variance, and independent variables that are not related to errors. • Use PROC REG with the MODEL statement. • PROC REG DATA=sm.airline; • MODEL cost = output fuel load; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ORDINARY LEAST SQUARES 2 • Imposing restrictions • PROC REG DATA=sm.airline; • MODEL cost = output fuel load /NOINT; • PROC REG DATA=sm.airline; • MODEL cost = output fuel load; • RESTRICT load=1; • Hypothesis Test (Wald Test) • PROC REG DATA=sm.airline; • MODEL cost = output fuel load; • TEST output=0; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ORDINARY LEAST SQUARES 3 • Get residuals, DW for AR(1) • PROC REG DATA=sm.airline; • MODEL cost = output fuel load /R DW; • RUN; • Check multicollinearity. • PROC REG DATA=sm.airline; • MODEL cost = output fuel load /COLLIN VIF TOL; • RUN; • Serious multicollinearity if tolerance level < (1-R2) or .1, VIF> 10; Eigenvalue <.01, Condition index <50, or proportion of variation > .8 University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS ORDINARY LEAST SQUARES 4 • OLS has strong assumptions that are easily violated in the real world. • PROC NLIN for nonlinear models • PROC SYSLIN for equation systems with errors correlated • PROC AUTOREG and ARIMA for autocorrelation • PROC LOGISTIC and QLIM for categorical dependent variables University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS LOGIT/PROBIT MODELS 1 • Use PROC LOGISTIC, PROBIT, QLIM, GENMOD to fit Logit and Probit Models PROC LOGISTIC DESCENDING DATA = sm.trust; • MODEL trust = educate income age male; • PROC PROBIT DATA = sm.trust; • MODEL trust = educate income age male /DIST=LOGISTIC; • PROC QLIM DATA=sm.trust; • MODEL trust = educate income age male /DISCRETE(DIST=LOGIT); • PROC GENMOD DATA = sm.trust DESC; • MODEL trust = educate income age male /DIST=BINOMIAL LINK=LOGIT; • RUN; • LOGISTIC produces opposite sigens University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS LOGIT/PROBIT MODELS 2 • Compute odd ratios using the UNITS statement • PROC LOGISTIC DATA = sm.trust; • MODEL trust(EVENT='1') = educate income age male; • UNITS educate=SD income=SD age=SD; • RUN; • For a unit increase in x, the odds of having 1 are expected to change by a factor of odd ratios =exp(b_hat*sd). • Marginal effects need computation. University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS LOGIT/PROBIT MODELS 3 • Estimate Probit models • PROC PROBIT DATA = sm.trust; • MODEL trust = educate income age male; • PROC LOGISTIC DATA = sm.trust DESC; • MODEL trust = educate income age male /LINK=PROBIT; • PROC QLIM DATA=sm.trust; • MODEL trust = educate income age male /DISCRETE (DIST=NORMAL); • RUN; • Logistic and QLIM are recommended University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PANEL DATA MODELS 1 • Fixed effect model assumes different intercepts among groups or periods. • Fixed effect model is in fact a dummy variable least squares model. • Random effect model assumes different variances among groups or periods. • In SAS, PROC PANEL and TSCSREG fit fixed and random effect models. • PROC PANEL is preferred. University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PANEL DATA MODELS 2 • A dummy variable least squares model. • PROC REG DATA=sm.airline; • MODEL cost = g1-g5 output fuel load; • RUN; • Fixed effect model using PROC PANEL that fits the adjusted within effect model • PROC PANEL DATA=masil.airline; • ID airline year; • MODEL cost = output fuel load /FIXONE; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS PANEL DATA MODELS 3 • Random effect model using PROC PANEL and TSCSREG • PROC PANEL DATA=sm.airline; • ID airline year; • MODEL cost = output fuel load /RANONE; • RUN; • PROC TSCSREG DATA=sm.airline; • ID airline year; • MODEL cost = output fuel load /RANONE; • RUN; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS FACTOR ANALYSIS 1 • Extract a small number of factors (latent variables) out of many manifest variables (observed variables). • PROC FACTOR DATA=sm.survey; • VAR q1-q20; • RUN; • Rotation methods (e.g., VARIMAX, PARSIMAX, EQUAMAX, and PROMAX) in ROTATE= or R= and the number of factors in NFACTORS= or N=. • PROC FACTOR DATA=sm.survey ROTATE=VARIMAX NFACTORS=3; • VAR q1-q20; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS FACTOR ANALYSIS 2 • Method of extracting factors such as principal component analysis-default method, maximum likelihood (ML), and principal factor analysis (PRINIT). • PROC FACTOR DATA=sm.survey METHOD=ML R=PROMAX N=3; • VAR q1-q20; • Store factor scores using OUT=. Variables Factor1, Factor2, Factor3, … are created in the data set. • PROC FACTOR DATA=sm.survey M=ML R=VARIMAX N=3 OUT=sm.surveyScore; • VAR q1-q20; University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS RELIABILITY TEST • PROC CORR produces Chronbach‘s coefficient alpha statistic with ALPHA. • PROC CORR DATA=sm.survey ALPHA NOMISS; • VAR q1-q20; • RUN; • NOMISS excludes observations with missing values. • Alpha larger than .8 indicates high reliability of measurement. • Section labeled as “Cronbach Coefficient Alpha with Deleted Variable” lists alpha if the variable is removed. University Information Technology Services Center for Statistical and Mathematical Computing
Data Analysis Using SAS REFERENCES • Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Sage. • Muller, Keith E., and Bethel A. Fetterman. 2002. Regression and ANOVA: An Integrated Approach Using SAS Software. Cary, NC: SAS Institute. • Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed.. Cary, NC: SAS Institute. • Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS Examples. Cary, NC: SAS Institute. • http://v9doc.sas.com/sasdoc/ • http://www.indiana.edu/~statmath/stat/all/power/index.html • http://www.indiana.edu/~statmath/stat/all/normality/index.html • http://www.indiana.edu/~statmath/stat/all/ttest/index.html • http://www.indiana.edu/~statmath/stat/all/panel/index.html • http://www.indiana.edu/~statmath/stat/all/cdvm/index.html • http://www.ats.ucla.edu/stat/sas/ • http://www.listserv.uga.edu/archives/sas-l.html University Information Technology Services Center for Statistical and Mathematical Computing