1 / 28

Lesson 5 - Topics

Lesson 5 - Topics. Creating new variables in the data step SAS Functions Programs 5-6 in course notes LSB 3:1-6,11-12. Creating New Variables. Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); Indirect assignments (if/then/else)

jmarci
Download Presentation

Lesson 5 - Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 5 - Topics • Creating new variables in the data step • SAS Functions • Programs 5-6 in course notes • LSB 3:1-6,11-12

  2. Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); • Indirect assignments (if/then/else) if age < 50 then young = 1; else young = 2; if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3;

  3. Direct Assignments(Formulas) • Example c = a + b ; So if a = 2, b =3, c = 5; What if a is missing, what is c? C will be missing What if b is missing?

  4. If/then/else Statements With if-then-else definitions SAS stops executing after the first true statement if income < 15 then tax = 1; else if income < 25 then tax = 2; else if income >=25 then tax = 3; What if income is 10? What if income is 23? What if income is 30? What if income is missing? Tax = 1 Tax = 2 Tax = 3 Tax = 1

  5. Creating New Variables Create a new variable with 2 levels, one for college graduates and one for non-college graduates.

  6. Program 5 DATA tdata; INFILE‘C:\SAS_Files\tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. @123 sbp12 3. ; * This way will code missing values to the value 2; if educ < 7then grad1 = 2 ; else if educ >=7then grad1 = 1 ; * The next two ways are equivalent and are correct; if educ < 7 and educ ne .then grad2 = 2; else if educ >=7then grad2 = 1; * IN is a special function in SAS ; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1; New variable defines go after the input statement

  7. PROCFREQDATA=tdata; TABLES educ grad1 grad2 grad3 ; Cumulative Cumulative educ Frequency Percent Frequency Percent --------------------------------------------------------- 1 3 3.03 3 3.03 3 4 4.04 7 7.07 4 23 23.23 30 30.30 5 14 14.14 44 44.44 6 12 12.12 56 56.57 7 16 16.16 72 72.73 8 10 10.10 82 82.83 9 17 17.17 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad1 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.00 43 43.00 2 57 57.00 100 100.00 Cumulative Cumulative grad2 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Cumulative Cumulative grad3 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 43 43.43 43 43.43 2 56 56.57 99 100.00 Frequency Missing = 1 Coded the missing value for educ to 2

  8. PROCFREQDATA=tdata; TABLES educ*grad1 /MISSINGNOCUMNOPERCENTNOROWNOCOL; TITLE'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad1 educ grad1 Frequency‚ 1‚ 2‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ . ‚ 0 ‚ 1 ‚ 1 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 3 ‚ 3 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 0 ‚ 4 ‚ 4 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 0 ‚ 23 ‚ 23 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 5 ‚ 0 ‚ 14 ‚ 14 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 6 ‚ 0 ‚ 12 ‚ 12 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 7 ‚ 16 ‚ 0 ‚ 16 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 8 ‚ 10 ‚ 0 ‚ 10 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 9 ‚ 17 ‚ 0 ‚ 17 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 43 57 100 This shows that the missing value for educ got assigned a value of 2

  9. PROCFREQDATA=tdata; TABLES educ*grad2 /MISSINGNOCUMNOPERCENTNOROWNOCOL; TITLE'Use Crosstabulation to Verify Recoding'; RUN; Table of educ by grad2 educ grad2 Frequency| .| 1| 2| Total ---------+--------+--------+--------+ . | 1 | 0 | 0 | 1 ---------+--------+--------+--------+ 1 | 0 | 0 | 3 | 3 ---------+--------+--------+--------+ 3 | 0 | 0 | 4 | 4 ---------+--------+--------+--------+ 4 | 0 | 0 | 23 | 23 ---------+--------+--------+--------+ 5 | 0 | 0 | 14 | 14 ---------+--------+--------+--------+ 6 | 0 | 0 | 12 | 12 ---------+--------+--------+--------+ 7 | 0 | 16 | 0 | 16 ---------+--------+--------+--------+ 8 | 0 | 10 | 0 | 10 ---------+--------+--------+--------+ 9 | 0 | 17 | 0 | 17 ---------+--------+--------+--------+ Total 1 43 56 100 This shows that the missing value for educ got assigned a value of missing

  10. * Recode sbp12 into 3 levels; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12c = 2 ; else if sbp12 >=140then sbp12c = 3 ; With if-then-else definitions SAS stops executing after the first true statement Values < 120 will be assigned value of 1 Values 120-139 will be assigned value of 2 Values >=140 will be assigned value of 3 Missing values will be assigned to missing

  11. PROCFREQDATA=tdata; TABLES sbp12c sbp12; RUN; OUTPUT Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 93 1 1.09 1 1.09 94 1 1.09 2 2.17 101 1 1.09 3 3.26 104 1 1.09 4 4.35 105 1 1.09 5 5.43 (more values) 147 1 1.09 87 94.57 148 1 1.09 88 95.65 149 1 1.09 89 96.74 153 1 1.09 90 97.83 154 1 1.09 91 98.91 158 1 1.09 92 100.00 Frequency Missing = 8

  12. How come no values of 2 and why so many missing? * Easy but costly error to make; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12 = 2 ; else if sbp12 >=140then sbp12c = 3 ; PROCFREQDATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51

  13. Important Facts When Creating New Variable 1. New variables are initialized to missing 2. Missing values are < any value if var < value(true if var is missing) 3. Reference missing values for numeric variables as . 4. Reference missing values for character variables as ' ' if sbp = . then ... (or if missing(sbp)) if clinic = ' ' then ...

  14. SAS Handling of Missing Data When Creating New Variables • Direct assignments(formulas): c = a + b ; d = 2*a + 3*b + 7*c ; bmi = weight/(height*height); If any variable on the right-hand side is missing then the new variable will be missing • Indirect assignments if age < 50 then young = 1; else young=2; New variables are initialized to missing but may be given a value if any of the IF statements are true

  15. Checks you can make to be sure new variables • are created correctly • Display original and new variables. • PROCPRINTDATA=tdata (OBS=20); • VAR educ college ; • Run PROC MEANS on original and new variable. Make sure both variables have same number of missing values. • PROCMEANSDATA=tdata; • VAR educ college; • Run PROC FREQ on original and new variable. • PROCFREQDATA=tdata; • TABLES educ college educ*college;

  16. What Value to Set New Variable if age < 20 then teenager = 1; else if age >=20 then teenager = 2; if age < 20 then teenager = 1; else if age >=20 then teenager = 0; if age < 20 then teenager = ‘YES’; else if age >=20 then teenager = ‘NO’;

  17. If-then-do statements * Conditionally execute several statements; * Create indicator variables for race; * Make sure race variable not missing; if race ne . thendo; white = 0; black = 0; asian = 0; other = 0; if race = 1 then white = 1; if race = 2 then black = 1; if race = 3 then asian = 1; if race = 4 then other = 1; end;

  18. proc means n mean sum; var white black asian other; run; The MEANS Procedure Variable N Mean Sum ----------------------------------------------- white 100 0.8100000 81.0000000 black 100 0.1800000 18.0000000 asian 100 0.0100000 1.0000000 other 100 0 0 -----------------------------------------------

  19. DO LOOPS WITH ARRAYS - Used to Shorten Code - Used when repeating same code - Used with DO/END loop ARRAY wtlb(3) wt1 wt2 wt3; ARRAY wtkg(3) newwt1 newwt2 newwt3; DO index = 1to3; wtkg(index) = wtlb(index) / 2.2; END; /* same as the following code Newwt1 = wt1 / 2.2 ; Newwt2 = wt2 / 2.2; Newwt3 = wt3 / 2.2; *************************************/

  20. * Program 6 SAS Functions ; DATA example; INFILE‘C:\SAS_Files\tomhs.data' ; INPUT @058 height 4. @085 weight 5. @172 ursod 3. @236 (se1-se10) (1.0 + 1); bmi = (weight*703.0768)/(height*height); rbmi1 = ROUND(bmi,1); rbmi2 = ROUND(bmi,.1); lursod = LOG(ursod); seavg = MEAN (OF se1-se10); semax = MAX (OF se1-se10); semin = MIN (OF se1-se10);

  21. * Use of dash notation ; seavg = MEAN (OF se1-se10); This is the same as seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); The OF is very important. Otherwise SAS thinks you are subtracting se10 from se1. To use this notation the ROOT of the name must be the same.

  22. * Two ways of computing average ; seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10); Versus seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10; Using mean function computes the average of non-missing values. Result is missing only if all values all missing. Using + formula requires all values be non-missing otherwise result will be missing if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10); What does this statement do?

  23. * Compute 10 new variables, 100 if se is present and 0 if not; ARRAY se (10) se1-se10; ARRAY hse(10) hse1-hse10; DO senumber = 1to10; if se(senumber) = 1then hse(senumber) = 0; else if se(senumber) in(2,3,4) then hse(senumber) = 100; END; *** For senumber = 1 the code is ************* if se1 = 1then hse1 = 0; else if se1 in(2,3,4) then hse1 = 100; New variables

  24. PROCPRINTDATA = example (OBS=10); VAR bmi rbmi1 rbmi2 seavg semin semax ; TITLE'Listing of Selected Data for 10 Patients '; RUN; PROCFREQDATA = example; TABLES semax; TITLE'Distribution of Worse Side Effect Value'; TITLE2'Side Effect Scores Range from 1 to 4'; RUN; PROCMEANSDATA = example; VAR hse1-hse10; TITLE'Percent of Patients With Condition by Condition'; RUN; odsgraphicson; PROCUNIVARIATEDATA = example ; VAR ursod lursod; QQPLOT ursod lursod; TITLE'Quantile Plots for Urine Sodium Data'; RUN;

  25. Listing of Selected Data for 10 Patients Obs bmi rbmi1 rbmi2 seavg semin semax 1 28.2620 28 28.3 1.1 1 2 2 35.9963 36 36.0 1.0 1 1 3 27.0489 27 27.0 1.0 1 1 4 28.2620 28 28.3 1.1 1 2 5 33.2008 33 33.2 1.0 1 1 6 27.7691 28 27.8 1.2 1 2 7 32.6040 33 32.6 1.0 1 1 8 22.4057 22 22.4 1.2 1 2 9 37.2037 37 37.2 1.1 1 2 10 33.1717 33 33.2 1.7 1 3

  26. Distribution of Worse Side Effect Value Side Effect Scores Range from 1 to 4 The FREQ Procedure Cumulative Cumulative semax Frequency Percent Frequency Percent ---------------------------------------------------------- 1 33 33.00 33 33.00 2 52 52.00 85 85.00 3 13 13.00 98 98.00 4 2 2.00 100 100.00 2 patients had at least 1 severe side effect

  27. Percent of Patients With Condition by Condition Type The MEANS Procedure Variable N Mean Std Dev Minimum Maximum hse1 100 12.0000000 32.6598632 0 100.0000000 hse2 100 21.0000000 40.9360181 0 100.0000000 hse3 100 8.0000000 27.2659924 0 100.0000000 hse4 100 13.0000000 33.7997669 0 100.0000000 hse5 100 10.0000000 30.1511345 0 100.0000000 hse6 100 30.0000000 46.0566186 0 100.0000000 hse7 100 16.0000000 36.8452949 0 100.0000000 hse8 100 31.0000000 46.4823199 0 100.0000000 hse9 100 7.0000000 25.6432400 0 100.0000000 hse10 100 14.0000000 34.8735088 0 100.0000000 These means are percent of patients with se

  28. Log transformed value shows a better linear pattern

More Related