280 likes | 291 Views
Lesson 5 - Topics. Creating new variables in the data step SAS Functions Programs 5-6 in course notes LSB 3:1-6,11-12. Creating New Variables. Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); Indirect assignments (if/then/else)
E N D
Lesson 5 - Topics • Creating new variables in the data step • SAS Functions • Programs 5-6 in course notes • LSB 3:1-6,11-12
Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); • Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3;
Direct Assignments(Formulas) • Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing?
If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What if income is 10? What if income is 23? What if income is 30? What if income is missing? Tax = 1 Tax = 2 Tax = 3 Tax = 1
Creating New Variables Create a new variable with 2 levels, one for college graduates and one for non-college graduates.
Program 5 DATA tdata; INFILE‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. @123 sbp12 3. ; * This way will code missing values to the value 2; if educ < 7then grad1 = 2 ; else if educ >=7then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne .then grad2 = 2; else if educ >=7then grad2 = 1; * IN is a special function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; New variable defines go after the input statement
PROCFREQDATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent --------------------------------------------------------- 1 3 3.03 3 3.03 3 4 4.04 7 7.07 4 23 23.23 30 30.30 5 14 14.14 44 44.44 6 12 12.12 56 56.57 7 16 16.16 72 72.73 8 10 10.10 82 82.83 9 17 17.17 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad1 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.00 43 43.00 2 57 57.00 100 100.00 Cumulative Cumulative grad2 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad3 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Coded the missing value for educ to 2
PROCFREQDATA=tdata; TABLES educ*grad1 /MISSINGNOCUMNOPERCENTNOROWNOCOL; TITLE'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ . ‚ 0 ‚ 1 ‚ 1 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 3 ‚ 3 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 0 ‚ 4 ‚ 4 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 0 ‚ 23 ‚ 23 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 5 ‚ 0 ‚ 14 ‚ 14 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 6 ‚ 0 ‚ 12 ‚ 12 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 7 ‚ 16 ‚ 0 ‚ 16 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 8 ‚ 10 ‚ 0 ‚ 10 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 9 ‚ 17 ‚ 0 ‚ 17 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 43 57 100 This shows that the missing value for educ got assigned a value of 2
PROCFREQDATA=tdata; TABLES educ*grad2 /MISSINGNOCUMNOPERCENTNOROWNOCOL; TITLE'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad2 educ grad2 Frequency| .| 1| 2| Total ---------+--------+--------+--------+ . | 1 | 0 | 0 | 1 ---------+--------+--------+--------+ 1 | 0 | 0 | 3 | 3 ---------+--------+--------+--------+ 3 | 0 | 0 | 4 | 4 ---------+--------+--------+--------+ 4 | 0 | 0 | 23 | 23 ---------+--------+--------+--------+ 5 | 0 | 0 | 14 | 14 ---------+--------+--------+--------+ 6 | 0 | 0 | 12 | 12 ---------+--------+--------+--------+ 7 | 0 | 16 | 0 | 16 ---------+--------+--------+--------+ 8 | 0 | 10 | 0 | 10 ---------+--------+--------+--------+ 9 | 0 | 17 | 0 | 17 ---------+--------+--------+--------+ Total 1 43 56 100 This shows that the missing value for educ got assigned a value of missing
* Recode sbp12 into 3 levels; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12c = 2 ; else if sbp12 >=140then sbp12c = 3 ; With if-then-else definitions SAS stops executing after the first true statement Values < 120 will be assigned value of 1 Values 120-139 will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing
PROCFREQDATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 93 1 1.09 1 1.09 94 1 1.09 2 2.17 101 1 1.09 3 3.26 104 1 1.09 4 4.35 105 1 1.09 5 5.43 (more values) 147 1 1.09 87 94.57 148 1 1.09 88 95.65 149 1 1.09 89 96.74 153 1 1.09 90 97.83 154 1 1.09 91 98.91 158 1 1.09 92 100.00 Frequency Missing = 8
How come no values of 2 and why so many missing? * Easy but costly error to make; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12 = 2 ; else if sbp12 >=140then sbp12c = 3 ; PROCFREQDATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51
Important Facts When Creating New Variable 1. New variables are initialized to missing 2. Missing values are < any value if var < value(true if var is missing) 3. Reference missing values for numeric variables as . 4. Reference missing values for character variables as ' ' if sbp = . then ... (or if missing(sbp)) if clinic = ' ' then ...
SAS Handling of Missing Data When Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing • Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true
Checks you can make to be sure new variables • are created correctly • Display original and new variables. • PROCPRINTDATA=tdata (OBS=20); • VAR educ college ; • Run PROC MEANS on original and new variable. Make sure both variables have same number of missing values. • PROCMEANSDATA=tdata; • VAR educ college; • Run PROC FREQ on original and new variable. • PROCFREQDATA=tdata; • TABLES educ college educ*college;
What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age < 20 then teenager = 1; else if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’;
If-then-do statements * Conditionally execute several statements; * Create indicator variables for race; * Make sure race variable not missing; if race ne . thendo; white = 0; black = 0; asian = 0; other = 0; if race = 1 then white = 1; if race = 2 then black = 1; if race = 3 then asian = 1; if race = 4 then other = 1; end;
proc means n mean sum; var white black asian other; run; The MEANS Procedure Variable N Mean Sum ----------------------------------------------- white 100 0.8100000 81.0000000 black 100 0.1800000 18.0000000 asian 100 0.0100000 1.0000000 other 100 0 0 -----------------------------------------------
DO LOOPS WITH ARRAYS - Used to Shorten Code - Used when repeating same code - Used with DO/END loop ARRAY wtlb(3) wt1 wt2 wt3; ARRAY wtkg(3) newwt1 newwt2 newwt3; DO index = 1to3; wtkg(index) = wtlb(index) / 2.2; END; /* same as the following code Newwt1 = wt1 / 2.2 ; Newwt2 = wt2 / 2.2; Newwt3 = wt3 / 2.2; *************************************/
* Program 6 SAS Functions ; DATA example; INFILE‘C:\SAS_Files\tomhs.data' ; INPUT @058 height 4. @085 weight 5. @172 ursod 3. @236 (se1-se10) (1.0 + 1); bmi = (weight*703.0768)/(height*height); rbmi1 = ROUND(bmi,1); rbmi2 = ROUND(bmi,.1); lursod = LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10);
* Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same.
* Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of non-missing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do?
* Compute 10 new variables, 100 if se is present and 0 if not; ARRAY se (10) se1-se10; ARRAY hse(10) hse1-hse10; DO senumber = 1to10; if se(senumber) = 1then hse(senumber) = 0; else if se(senumber) in(2,3,4) then hse(senumber) = 100; END; *** For senumber = 1 the code is ************* if se1 = 1then hse1 = 0; else if se1 in(2,3,4) then hse1 = 100; New variables
PROCPRINTDATA = example (OBS=10); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE'Listing of Selected Data for 10 Patients '; RUN; PROCFREQDATA = example; TABLES semax; TITLE'Distribution of Worse Side Effect Value'; TITLE2'Side Effect Scores Range from 1 to 4'; RUN; PROCMEANSDATA = example; VAR hse1-hse10; TITLE'Percent of Patients With Condition by Condition'; RUN; odsgraphicson; PROCUNIVARIATEDATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE'Quantile Plots for Urine Sodium Data'; RUN;
Listing of Selected Data for 10 Patients Obs bmi rbmi1 rbmi2 seavg semin semax 1 28.2620 28 28.3 1.1 1 2 2 35.9963 36 36.0 1.0 1 1 3 27.0489 27 27.0 1.0 1 1 4 28.2620 28 28.3 1.1 1 2 5 33.2008 33 33.2 1.0 1 1 6 27.7691 28 27.8 1.2 1 2 7 32.6040 33 32.6 1.0 1 1 8 22.4057 22 22.4 1.2 1 2 9 37.2037 37 37.2 1.1 1 2 10 33.1717 33 33.2 1.7 1 3
Distribution of Worse Side Effect Value Side Effect Scores Range from 1 to 4 The FREQ Procedure Cumulative Cumulative semax Frequency Percent Frequency Percent ---------------------------------------------------------- 1 33 33.00 33 33.00 2 52 52.00 85 85.00 3 13 13.00 98 98.00 4 2 2.00 100 100.00 2 patients had at least 1 severe side effect
Percent of Patients With Condition by Condition Type The MEANS Procedure Variable N Mean Std Dev Minimum Maximum hse1 100 12.0000000 32.6598632 0 100.0000000 hse2 100 21.0000000 40.9360181 0 100.0000000 hse3 100 8.0000000 27.2659924 0 100.0000000 hse4 100 13.0000000 33.7997669 0 100.0000000 hse5 100 10.0000000 30.1511345 0 100.0000000 hse6 100 30.0000000 46.0566186 0 100.0000000 hse7 100 16.0000000 36.8452949 0 100.0000000 hse8 100 31.0000000 46.4823199 0 100.0000000 hse9 100 7.0000000 25.6432400 0 100.0000000 hse10 100 14.0000000 34.8735088 0 100.0000000 These means are percent of patients with se