310 likes | 907 Views
Summarizing Data. STT 501 Spring 2007. Summary Statistics. Two base SAS procedures can be used to generate summary statistics for quantitative values—PROC MEANS and PROC UNIVARIATE
E N D
Summarizing Data STT 501 Spring 2007
Summary Statistics • Two base SAS procedures can be used to generate summary statistics for quantitative values—PROC MEANS and PROC UNIVARIATE • Both procedures are capable of computing several summary statistics: mean, standard deviation, quantiles and several others. • PROC UNIVARIATE has a more comprehensive set of data analysis tools available.
Means Procedure • General Syntax (lots of stuff that we won’t always use): PROC MEANS <option(s)> <statistic-keyword(s)>; BY <DESCENDING> variable-1 <... <DESCENDING> variable-n><NOTSORTED>; CLASSvariable(s) </ option(s)>; FREQvariable; IDvariable(s); OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)> </ option(s)> ; TYPESrequest(s); VARvariable(s) < / WEIGHT=weight-variable>; WAYSlist; WEIGHTvariable; RUN;
Means Procedure • Most Likely Used Statements: • var: specifies the variables you wish to summarize. • class: specifies variables used to group the data before summarizing. • Sometimes Used Statements: • freq: specifies a variable that indicates the frequency of each observation—useful if the data contains several occurrences of the same value that have been grouped together. • id: specifies variables that are not used in the analysis that you would like to keep for creation of an output data set.
Means Procedure • Use the “projects” data set in our data folder: procmeansdata=stt501.projects; run; • What output appears?
Means Procedure • Default behavior is to summarize ALL numeric variables, whether it makes sense or not (SAS doesn’t know any better). Use the var statement to override this behavior. What do you think of these?
Means Procedure • Use a class statement to break up the analysis over 2 or more categories: procmeansdata=stt501.projects; class region; var personel; run; • Summaries for personnel costs are produced for each different value of the variable region.
Means Procedure The variable listed in the var statement is noted here Each level of the class variable produces a separate summary
Means Procedure • Specifying statistics keywords: procmeansdata=stt501.projects minq1medianq3max; class region pol_type; var personel; run; • Keywords override the default summary statistics, you can find a listing of these in the help section. • Note: with two variables in the class statement, a summary is produced for each combination of values for those variables.
Means Procedure With two variables in the class statement, a summary is produced for each combination. Statistics are provided in the order listed in the proc means statement
Univariate Procedure • The univariate procedure is similar to the means procedure in that it accepts var and class statements. • However, it has many more statements/options available to it, and it produces a greater amount of output by default.
Univariate Procedure • Try this code: procunivariatedata=stt501.projects; var personel; run; • The default output includes a more extensive list of summary statistics than the means procedure.
Univariate Procedure These tables contain some basic summary statistics like mean, median and standard deviation This table gives percentiles/ quantiles, including the five number summary. This table gives results of testing whether the mean is zero or not. The second page of output has a table containing the five largest and smallest values
Univariate Procedure • The univariate procedure supports class statements in the same manner as PROC MEANS • Try this code: procunivariatedata=stt501.projects; class region pol_type; var personel; run; • The default output is now produced for each combination of values for the variables listed in the class statement.
Univariate Procedure • Univariate can also produce histograms (for any variable that was listed in the by statement). Try this code: procunivariatedata=stt501.projects; class pol_type; var personel; histogram personel; run; • This produces a histogram of personnel costs for each type of pollution.
Univariate Procedure Note the common scaling on both the vertical and horizontal axes
Boxplots • We can make boxplots in SAS using (surprisingly enough) proc boxplot. It’s general form is: procboxplotdata=dataset; plot analysis-var*group-var; run; quit; • The data must be pre-sorted on the group variable
Boxplots • For example: procsortdata=stt501.projects out=proj_sort; by region; run; procboxplotdata=proj_sort; plot jobtotal*region; run; quit; • Produces a set of side-by-side boxplots.
Boxplots By default, SAS produces skeletal boxplots—no outlier detection is done or shown on the graph. This can be changed with the boxstyle option.
Boxplots • To get an outlier boxplot, use: procboxplotdata=proj_sort; plot jobtotal*region/boxstyle=schematic; run; quit; • Each outlier can be tagged with an id variable as well: procboxplotdata=proj_sort; plot jobtotal*region/boxstyle=schematicid; id date; run; quit;
Boxplots If outliers are compressed together, ids will be also.