480 likes | 493 Views
Learn the processes, internals, and defaults of the DATA step in SAS programming for optimal utilization of this powerful tool.
E N D
SAS Essentials How SAS Thinks Neil.Howard@amgen.com
“The DATA step is your most powerful programming tool.So understand and use it well.” Socrates
Objectives • understand DATA step: • processes • internals • defaults
processes • internals • defaults • compilation of DATA step source code • execution of resultant machine code
processes • internals • defaults compile and execute phases of: • INPUT (non SAS data) • SET
processes • internals • defaults Compile Time Activities • syntax scan • source code translation to machine language • definition of input and output files
processes • internals • defaults Compile TimeActivities • input buffer • LPDV (logical program data vector) • data set descriptor information
processes • internals • defaults Creation of LPDV • Variables added in the order seen by the compiler • during parsing and interpretation of source statements
location critical BY WHERE ARRAY ATTRIB FORMAT INFORMAT LENGTH location irrelevant DROP KEEP LABEL RENAME RETAIN • processes • internals • defaults Compile Time Statements
processes • internals • defaults Retained Variables • all SAS special variables • _N_ • _ERROR_ • all vars in RETAIN statement • all vars from SET, MERGE, or UPDATE • accumulator vars in SUM statement(s)
processes • internals • defaults Variables Not Retained • Variables from input statement • user defined variables (other than SUM statement)
processes • internals • defaults Type and Length of Variables • determined at compile time • by first reference to the compiler (in the DATA step) • Numerics: • length is 8 during DATA step processing • length is an output property
INPUT statement reading non-SAS data
Compile Loop and LPDV data a ; put _all_ ; *write LPDV to LOG; input idnum diagdate: mmddyy8. sex $ rx_grp $ 10. ; time = intck (‘year’, diagdate, today() ) ; put _all_; *write LPDV to LOG; cards ; 1 09-09-52 F placebo 2 11-15-64 M 300 mg. 3 04-07-48 F 600 mg. run;
input buffer logical program data vector idnum diagdate sex rx_grp time numeric numeric char char numeric 8 8 8 10 8 Building descriptor portion of SAS data set
idnum diagdate sex rx_grp time _N__ERROR_ numeric numeric char char numeric 8 8 8 10 8 logical program data vector DKR*keep keep keep keep keepdrop drop *Drop/keep/rename
Execution of a DATA Step _N_ + 1 Initialization of LPDV read input file Y next step end of file? N process statements in step termination implied output
processes • internals • defaults DATA Step Execution • Implied read/write loop, stopped by: • no more data to read • explicit STOP • no input data • some execution time errors
processes • internals • defaults Execution Time Activities • execute initialize-to-missing (ITM) • read from input source • modify data using user-controlled statements • supply values of variables to LPDV • output observation to SAS data set
processes • internals • defaults Initialization • _N_ set to loop count • _ERROR_ set to 0 • user variables set to missing
Execution Loop - raw data data a ; put _all_ ; *write LPDV to LOG; input idnum diagdate: mmddyy8. sex $ rx_grp $ 10. ; time = intck (‘year’, diagdate, today() ) ; put _all_; *write LPDV to LOG; cards ; 1 09-09-52 F placebo 2 11-15-64 M 300 mg. 3 04-07-48 F 600 mg. run; proc contents; run; proc print; run;
LPDV IDNUM DIAGDATE SEX RX_GRP TIME _N_ . . . 1 1 -2670 F placebo 48 1 . . . 2 2 1780 M 300 mg. 36 2 . . . 3 3 -4286 F 600 mg. 52 3 . . . 4 (over all executions of DATA step……..)
2 data a ; 3 put _all_ ; *write LPDV to LOG; 4 input idnum 5 diagdate: mmddyy8. 6 sex $ 7 rx_grp $ 10. ; 8 time = intck ('year', diagdate, today() ) ; 9 put _all_; *write LPDV to LOG; 10 cards ; IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=1 IDNUM=1 DIAGDATE=-2670 SEX=F RX_GRP=placebo TIME=49 _ERROR_=0 _N_=1 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=2 IDNUM=2 DIAGDATE=1780 SEX=M RX_GRP=300 mg. TIME=37 _ERROR_=0 _N_=2 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=3 IDNUM=3 DIAGDATE=-4286 SEX=F RX_GRP=600 mg. TIME=53 _ERROR_=0 _N_=3 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=4 NOTE: The data set WORK.A has 3 observations and 5 variables. NOTE: The DATA statement used 0.59 seconds. 14 run; 15 16 proc contents; run; NOTE: The PROCEDURE CONTENTS used 0.39 seconds.
Data Set Name: WORK.A Observations: 3 • Member Type: DATA Variables: 5 • Engine: V612 Indexes: 0 • Created: 11:18 Saturday, January 20, 2001 Observation Length: 42 • Last Modified: 11:18 Saturday, January 20, 2001 Deleted Observations: 0 • Protection: Compressed: NO • Data Set Type: Sorted: NO • Label: • -----Engine/Host Dependent Information----- • Data Set Page Size: 8192 • Number of Data Set Pages: 1 • File Format: 607 • First Data Page: 1 • Max Obs per Page: 194 • Obs in First Data Page: 3 • -----Alphabetic List of Variables and Attributes----- • # Variable Type Len Pos • ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ • 5 TIME Num 8 34 • 2 DIAGDATE Num 8 8 • 1 IDNUM Num 8 0 • 4 RX_GRP Char 10 24 • 3 SEX Char 8 16
PROC PRINT IDNUM DIAGDATE SEX RX_GRP TIME 1 -2670 F placebo 48 2 1780 M 300 mg. 36 3 -4286 F 600 mg. 52
SET statement reading existing SAS data
DATA Step Compile • no input buffer • compiler reads descriptor portion of input SAS data set to build the LPDV • returns same variables/attributes, including new variables
processes • internals • defaults SET • determine which SAS data set to be read • identify next observation to be read • copy variable values to LPDV
Execution Loop - SAS data data sas_a ; put _all_ ; set a ; tot_rec + 1 ; put _all_ ; run;
Building LPDV from descriptor portion of old SAS data set logical program data vector idnum diagdate sex rx_grp time tot_rec numeric numeric char char numeric numeric 8 8 8 10 8 8 Building descriptor portion of new SAS data set
LPDV IDNUM DIAGDATE SEX RX_GRP TIME TOT_REC _N_ . . . 0 1 1 -2670 F placebo 48 1 1 1 -2670 F placebo 48 1 2 2 1780 M 300 mg. 36 2 2 2 1780 M 300 mg. 36 2 3 3 -4286 F 600 mg. 52 3 3 3 -4286 F 600 mg. 52 3 4 (over all executions of DATA step……..)
LOG idnum=. diagdate=. sex= rx_grp= time=. tot_rec=0 _ERROR_=0 _N_=1 idnum=1 diagdate=-2670 sex=F rx_grp=placebo time=48 tot_rec=1 _ERROR_=0 _N_=1 idnum=1 diagdate=-2670 sex=F rx_grp=placebo time=48 tot_rec=1 _ERROR_=0 _N_=2 idnum=2 diagdate=1780 sex=M rx_grp=300 mg. time=36 tot_rec=2 _ERROR_=0 _N_=2 idnum=2 diagdate=1780 sex=M rx_grp=300 mg. time=36 tot_rec=2 _ERROR_=0 _N_=3 idnum=3 diagdate=-4286 sex=F rx_grp=600 mg. time=52 tot_rec=3 _ERROR_=0 _N_=3 idnum=3 diagdate=-4286 sex=F rx_grp=600 mg. time=52 tot_rec=3 _ERROR_=0 _N_=4
PROC PRINT IDNUM DIAGDATE SEX RX_GRP TIME TOT_REC 1 -2670 F placebo 48 1 2 1780 M 300 mg. 36 2 3 -4286 F 600 mg. 52 3
Logic of a MERGE • compile • execute
data left; • input ID X Y ; • cards; • 1 88 99 • 2 66 77 • 44 55 • ; data right; input ID A $ B $ ; cards; 1 A14 B32 3 A53 B11 ;
proc sort data=left; by ID; run; proc sort data=right; by ID; run; data both; merge left (in=inleft) right (in=inright); by ID ; run;
logical program data vector first iteration: MATCH ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 1 88 99 A14 B32 1 1 1 0
logical program data vector second iteration: NO MATCH ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 2 66 77 1 0 2 0
logical program data vector third iteration: MATCH ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 3 44 55 A53 B11 1 1 3 0
Let’s try this again………………… • data left; • input ID X Y ; • cards; • 1 88 99 • 2 66 77 • 44 55 • ; data right; input ID A $ B $ ; cards; 1 A14 B32 3 A53 B11 ;
proc sort data=left; by ID; run; proc sort data=right; by ID; run; data both; merge left (in=inleft) right (in=inright); ***** by ID (one-on-one merge); run;
logical program data vector first iteration: 1:1 “MATCH” ID X Y A B _N_ _ERROR_ 1 88 99 A14 B32 1 0 1 OVERWRITTEN – value came from data set “right”
logical program data vector second iteration:1:1 “MATCH” ID X Y A B _N_ _ERROR_ 2 66 77 A53 B11 2 0 3 OVERWRITTEN – value came from data set “right”
logical program data vector third iteration:1:1 “NO MATCH” ID X Y A B _N_ _ERROR_ 3 44 55 3 0 MISSING – no values from “right”
Output SAS data set ID X Y A B 1 88 99 A14 B32 3 66 77 A53 B11 3 44 55
DATA Step Conclusions • Understanding internals and default activities allows you to: • make informed coding decisions • write flexible and efficient code • debug and test effectively • interpret results readily
Remember • We have discussed DEFAULTS • As soon as you add options, statements, features, etc., the default actions change; TEST them! • You can use these same tools to track what’s happening.