310 likes | 322 Views
Learn how to manipulate and work with data in SAS, including cleaning, reorganizing, sorting, combining datasets, and creating new variables.
E N D
STAT 6360 –Statistical Software Programming Working with Data in SAS Typically the data you bring into SAS need further manipulation before analysis. • They may need to be cleaned, reorganized, rescaled, sorted, combined with other datasets, new variables may need to be created, etc. Assignments One of the most basic operations on data is an assignment. An assignment statement takes the form variable = expression; • variable can be a new or existing variable in the dataset. • expression can be a constant, a variable in the dataset, or a function of one or more constant and variables (including the variable on the right hand side!). • Assignment statements occur in a data step. Very rarely in a PROC.
STAT 6360 –Statistical Software Programming Assignments Examples (see Lec3Examps.sas): freezpt=32; trtment="Drug A"; ht_in=ht_ft*12; xsq=x**2; * exponentiate with the ** operator (not x^2); stdylength = end_date - start_date; dist=dist/1.61; *convert dist from km to mi & overwrite; score=.; * set score to missing; trtA= (trt='A'); * create 0-1 dummy for treatment A; trt_numeric= 1*(trt='A') +2*(trt='B')+3*(trt='C'); c=1+3*2**2; * equals 13 not 64; c= 1 + 3*(2**2); * better (easier to read);
STAT 6360 –Statistical Software Programming Assignments • Operations on missing values generate missing values. Occurrences generate NOTEs in log. • Standard rules of mathematical precedence apply. • Use parentheses, spaces, to make readable code. • Mismatched types (numeric vs. character) result in conversions. • E.g., if freezpt had been defined as a character variable, the first assignment on the previous slide would convert 32 to ‘32’ before making the assignment. • Then, if we created a new numeric variable temp via temp=freezept+0; then again, a conversion would occur, this time from character to numeric. • Such conversions generate a NOTE in the log. • Such automatic conversions can be convenient and we may not mind the NOTE in the log. But such notes can also indicate mistakes in our code. • Better to convert with the input function to avoid these NOTEs so we know that, when such a NOTE appears, it means a mistake to be debugged.
STAT 6360 –Statistical Software Programming SAS Functions • SAS has hundreds of functions. • It also has CALL routines. These are similar to functions but • They cannot be used in an assignment statement. Instead they are invoked with the CALL statement. • They can make assignments, though, by replacing the value(s) of one or more of their argument(s). • They can yield multiple values at once (their main advantage). • They cannot be passed an expression, only variables and constants. • E.g., the statement call missing(x,y,z); Sets variables x, y and z to missing. Alternatively, could be done in three assignments statements: x=.; y=.; z=.; • I don’t find it necessary to use CALL routines very often, but occasionally they are useful. I won’t devote special attention to them, but they are not hard to use and you can read the SAS documentation for details.
STAT 6360 –Statistical Software Programming SAS Functions • Categories of SAS functions and CALL routines include Character Character String Matching Date and Time Descriptive Statistics Distance Financial Macro Mathematical Probability Random Number State & ZIP Code Variable Information • Functions take one or more arguments and return a value. They are of the form • function_name(arg1, arg2,…) • Arguments can be variable names, constant values, or expressions. • Functions can be nested.
STAT 6360 –Statistical Software Programming SAS Functions – Examples (see Lec3Examps.sas) town = propcase("winston-salem, n.c."); state = scan("athens, GA",-1); xmas2014 = '25dec14'd; days_til_xmas = datdif(today(),xmas2014,'Actual'); xmas_month=month(xmas2014); medscore = median(score1,score2,score3,score4); medscore = median(OF score1-score4); * crit val for 2-sided t test at alpha=0.05 with df=nu; t_crit = tinv(.975,30); * 2-sided pval for t test with df=nu; pval_ttest= 2*(1- probt(abs(teststat),nu)); x= 8+2*rannor(7); *gives a N(8,4) random #. 7 is seed for r.n.g.; uni1_10= floor(10*ranuni(7)+1); *gives random integer b/w 1 & 10; angle=mod(405,360); area=constant('PI')*(radius**2); one=log(constant('E')); *natural log; invalid=log(-1); *generates invalid argument NOTE in log; char_date="17mar2013"; *input converts character to numeric w/ an informat; num_date=input(char_date,date9.);
STAT 6360 –Statistical Software Programming Conditional Execution: IF THEN/ELSE Conditional execution can be accomplished with an IF THEN statement. E.g., * suppose missing vals were coded -99 in input data file; dataquiz; inputquizno score @@; if score=-99then score=.; datalines; 16028031004 -99580 ; run; • What follows IF is an expression. If the expression is non-zero and non-missingthe statement following THEN executes; otherwise not. • The expressions is often a logical comparison, but does not have to be. • Any valid SAS statement can follow then. • A group of statements can be executed conditionally with a DO END group.
STAT 6360 –Statistical Software Programming Conditional Execution: IF THEN/ELSE • An IF THEN statement can be followed by one or more ELSE statements: E.g., data SEC; inputuniv $ enrollment comma6.0 @@; length city $11; *comment this line to see what it does; ifuniv='UGA'thendo; mascot="Bulldog"; city="Athens"; end; elsedo; mascot="Tiger"; city="Baton Rouge"; end; datalines; UGA 34,816 LSU 29,718 ; run;
STAT 6360 –Statistical Software Programming Conditional Execution: IF THEN/ELSE • The expression is typically a logical comparison. SAS evaluates such comparisons as 1 (true), 0 (false) or missing. Examples: if age>=21then legal="YES"; else legal="NO"; ifTRT in ("A","B") then drug="Active" else drug="Placebo"; ifgender NE "F"then gender="M"; if(birthyrge1946 and birthyr le 1964) then boomer=1; else boomer=0; • SAS logical comparison operators have symbolic and text aliases:
STAT 6360 –Statistical Software Programming Conditional Execution: IF THEN/ELSE • The in comparison is very useful! Checks if a variable’s value is in a specified list. • Another logical operator is NOT, but be careful, it can be confusing. • The IF expression need not be in parentheses, but use them for readability. • Note that in numeric comparisons, missing values are treated as the smallest possible value. E.g., if nsibs=. for a subject, he/she will be grouped as an only child below, when only_child should be missing: if(nsibs LT 1) thenonly_child="YES"; elseonly_child="NO"; • A statement often used with IF THEN is DELETE, which deletes the current observation from the dataset. E.g., to keep data from only female subjects: if gender ne "F"thendelete; • An equivalent, but more puzzling form of the above statement is if gender="F";
STAT 6360 –Statistical Software Programming The RETAIN statement Recall that SAS processes a data step one observations at a time, one line at a time. In detail, here are the steps: • Read the current row from input data set into the Program Data Vector (PDV). • Process data step statements on that data row, one by one. • Output the PDV to the output dataset. • Increment the current data row to the next row in the input data set. • Return to step 1. • Thus, SAS loops through the input dataset one observation at a time. • At any given step the data from previous or later observations are not available. • An exception can be made with the RETAIN statement, which can be used to let a variable keep its value from the previous observation.
STAT 6360 –Statistical Software Programming Data Step - Program Flow Consider the following data step: dataclasses; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) thengradclass=1; elsegradclass=0; enroll=ugrads+grads; make=0; if ((gradclass& enroll>=5) or (gradclass=0 &enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; • Let’s follow the program execution.
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming Data Step - Program Flow data classes; input class $ ugrads grads @@; if (substr(class,5,1)+0 > 5) then gradclass=1; else gradclass=0; enroll=ugrads+grads; make=0; if ((gradclass & enroll>=5) or (gradclass=0 & enroll>=10)) then make=1; datalines; STAT4210 18 0 STAT6210 0 20 STAT8540 0 4 ; run; PDV: CLASSES:
STAT 6360 –Statistical Software Programming (Back to) The RETAIN statement At the beginning of a data step iteration, the value of any variable on the RETAIN statement is kept from the previous iteration. • On 1st iteration, there is no previous iteration from which to retain the value, so it is set to missing (by default) or an optional initial value specified on the RETAIN statement. • A statement of the form variable + expression; is called a sum statement and equivalent to retain variable 0; variable=sum(variable,expression); • See Peyton Manning example in Lec3Examps.sas.
STAT 6360 –Statistical Software Programming Arrays When you have a long list of variables that you want to modify or manipulate, a SAS array can be helpful. • In SAS, an array is a temporary, ordered grouping of variables that are identified by an array name. Individual variables in the grouping can be referenced via a subscripted form of the array name. • All variables in the array must be either numeric or character (can’t be mixed). • A SAS array is essentially a handy aliasing scheme. That is, it is a way to refer to a variable indirectly via a subscripted alias. • This is handy if you want to do the same operation to several variables because it allows you to put that operation in a DO loop.
STAT 6360 –Statistical Software Programming Arrays • E.g., in a nutrition study you collect body measurements every 6 months on a sample of subjects including height, weight and sitting height at baseline (time 0), at 6 mos, and 12 mos. • Suppose missing values were coded as -9, but you want to change to the SAS missing value indicator, a period. Use an array! databodymeas; inputsubj ht0 ht6 ht12 wt0 wt6 wt12 sit0 sit6 sit12; arraymyarray(9) ht0 ht6 ht12 wt0 wt6 wt12 sit0 sit6 sit12; doi=1to9; ifmyarray(i)=-9thenmyarray(i)=.; end; datalines; 1 142.2 142.8 143 35.3 36.6 36.5 77.3 78.05 77.5 2 139.9 -9 141.2 52.4 -9 54.4 75.9 -9 76.65 3 141.6 142.4 -9 32.9 32.9 -9 72.6 72.7 -9 ; run;
STAT 6360 –Statistical Software Programming < > means optional Arrays Syntax: ARRAYarray_name(subscript) <$><length> var1 var2 … • E.g., in the code on the previous page we created an array called myarray with 9 elements: myarray(1) points to ht0 myarray(2) points to ht6 … myarray(9) points to sit12 • When we operate on myarray(i), we are really doing the operation on the ith variable in our list. • In a sense, myarray(1), myarray(2),…,myarray(9) don’t exist! They are just short-cuts to the real variables ht0, ht6, …,sit12.
STAT 6360 –Statistical Software Programming Arrays In the previous example, we took existing variables ht0, ht6, etc., and defined an array to point to them. • An ARRAY statement can also be used to point to variables that don’t exist yet. • In this case, their declaration in the ARRAY statement creates them at the same time that they are grouped into an array. • In our second example, the dataset shopping contains 10 count-valued variables q1-q10. We use arrays to create new, dichotomous variables that indicate whether each count is >0. • Array orig groups the original variables q1-q10 for easy, subscripted reference. • Array new creates 10 new character variables called dichot_q1-dichot_q10. • When we check the value of orig(i), we are really checking the value of qi. • When we assign the value of new(i), we are really assigning the value of dichot_qi • orig(1),…,orig(10), new(1),…,new(10) aren’t really variables, they are just temporary aliases to the real variables. When the data step is finished, they cease to exist.