500 likes | 512 Views
Learn how to read SAS data sets, create new data sets, use BY groups, and control variables using DROP and KEEP options.
E N D
4 Chapter 11 Reading SAS Data • Objectives: • Create new SAS data set from existing SAS data sets • Use BY groups to process observations • Read observations by observation number – Direct access • Stop processing when necessary • Explicit write observations to output • Understanding how SAS data sets are read.
Reading a Single SAS data set The simplest syntax for reading a SAS data set is: Data sas-data-set; SET sas-data-set; In this chapter, a various of techniques are discussed for reading a SAS Data set: General syntax: DATA sas-data-set <(drop = keep = )>; SET sas-data-set <(drop= keep= firstobs= obs= ) END= POINT= >; BY Variables; SAS statements; RUN; NOTE: We have discussed DROP=, KEEP =, FIRSTOBS= and OBS= previously in the SET statement.
Controlling Variable Input using KEEP=, DROP= Input SAS data set Raw Data File DROP= and KEEP= data set options in “read” statement PDV DROP= and KEEP= data set options associated with an output data set DROP and KEEP statements Output SAS data set
Example of using DROP=, KEEP= in both Data Statement and SET statement. SAS applies data set options to input data sets before it • evaluates programming statements • applies data set options to output data sets. In the following example: mylib.military is the input SAS data set, and army is the output SAS data set. (drop = City State Country) applies to INPUT data set, mylib.military. These variables are not read from military data set. (keep=Code Airport) applies to OUTPUT data set, Army. data army (keep=Code Airport); set mylib.military (drop=City State Country); if Type eq 'Army' then output; run;
Example of selecting observations in SET statement data army (keep=Code Airport); set mylib.military (drop=City State Country firstobs=5 obs=15); if Type eq 'Army' then output; run; • NOTE: • Only input observations 5 to 15 from mylib.military data set. • Variables City, State and Country are not read from mylib.military data set. • Variables Code and Airport are not written to army data set.
Finding the first and last observations in a group when reading SAS data set:Using BY Group with SET statement; Data pilotn (drop = state); Setmylib.pilots (drop=id city homephone); BYJobcode; Run; NOTE: The BY Jobcode statement requires a PROC SORT; BY Jobcode; When BY Jobcode is executed, SAS creates two system variables: FIRST.jobcode: assigns value 1 when the first observation in the BY group is read, and 0 otherwise. LAST.jobcode: assigns value 1 when the LAST observation in the BY group is read, and 0 otherwise.
Am example resulted from Jobcode Suppose Jobcode consists of FAT, PLT and MAT three distinct observations. DATA Pilotn; SET Mylib.pilot; BY Jobcode; Creates the following results: NOTE: First.jobcode, Last.Jobcode are not stored as variables in SAS data set. But, they are system variables. We can use them in the program.
Why we need to know the first and last observation for each group of a Variable? Example : A company wants to compute the running total of payroll for each Department and include it as a new variable in the SAS data set.
Exercise 1 Open the following program: c11_1 (see the next three slides). • Run the first data step up to proc print; run; • Check the results (this uses Length statement, and SELECT WHEN statement to define agegp variable.) • Run the next part of the program (from Proc Sort statement to run; statement). Go to SAS LOG to check the results from the PUT statement to see the values of FIRST.AGEGP and LAST.AGEGP • Run the next set of program statements from Data diab3 to the last statement. Check the results and make sure you understand how the results are obtained from this SAS program.
C11_1 program Data diabetes; set mylib.diabetes; length agegp $ 10; select; when (age < 40) agegp='YOUNG'; when (40<= age < 50) agegp='MIDDLE'; when (age >=50) agegp='SENIOR'; otherwise agegp='UNDEFINED'; end; procprint; run;
/* Use SET, BY together to create two system variables first.var, last.var */ procsort data = diabetes out=diab_s; by agegp sex; run; Data diab2 (drop= weight height age); set diab_s (keep=Sex agegp age height weight pulse); by agegp sex; bmi=weight/height**2 * 703; put agegp= sex= pulse= bmi= first.agegp= last.agegp= first.sex= last.sex=; run; /*The above PUT statement output the values of the variables to the SAS LOG. */ /* Go to SAS LOG to see how the variables FIRST.AGEGP, LAST.AGEGP, FIRST.SEX and LAST.SEX are defined. */
/*The following IF-THEN DO statements use first.agegp and last.agegp to control computing the average BMI of each age group. */ Data diab3 (drop= weight height age); set diab_s (keep=Sex agegp age height weight pulse); by agegp sex; bmi=weight/height**2 * 703; if first.agegp=1 then do; sumbmi=0; ncase=0; end; sumbmi+bmi; ncase+1; if last.agegp then do; avgbmi=sumbmi/ncase; output; end; run; procprint label; varagegpncaseavgbmi; label agegp='Age Grouping‘ ncase = 'Num of Cases‘ avgbmi = 'Average BMI'; run;
OUTPUT statement • OUTPUT; statement forces SAS to output an observation to the SAS data set before it hits the RUN; statement at the end of the data step. • By default, SAS will output an observation when the process reaches RUN; statement. This is called ‘IMPLICIT’ output. • In many situations, we need to output observations earlier in order to output more observations or observations different from the one to be output implicitly, we will need the EXPLICIT OUTPUT statement: Output; • NOTE: Once OUTPUT is used, SAS will no longer use implicit output. Since each execution will only output data once. For Exercise 1, we want to output the average BMI for each age group, which is computed inside the DO loop. Therefore, we use OUTPUT; to create observations for each age group.
Additional Exercise Delete OUTPUT statement, run the program to check the results. Note you see there are many missing values for Average BMI and Age Grouping is reported for every individual observation. However, we only want the average BMI for each age group. OUTPUT statement helps us to accomplish it.
Reading Observations Using Direct Access: POINT = option in SET statement In general, observations are read sequentially. Using POINT = option in SET statement, you can read specific observations directly. Simplest General Syntax: Data output-sas-dataset; Var_name = n; SET sas-data-set POINT= var_name; OUTPUT; STOP; RUN;
Some cautions when using POINT = var_name in the SET statement NOTE1: the var_name is a temporary variable that defines the observation numbers to be read. NOTE2: STOP statement is used to stop the input after reading the observations specified by the Temporary variable in the SET statement. This is needed because the direct access to observations DOES NOT go to the end-of-file marker and it will continue to read the data set in unlimited loop. NOTE3: By default, SAS automatically output the observation to the SAS data set when it reaches the end of the DATA step. This is IMPLICIT output. However, when using POINT = , the execution has never reached the end of the DATA Step, therefore, the IMPLICIT output is not executed to output the observation. It requires EXPLICIT OUTPUT statement to output the observation one by one to the SAS data set, even when the end of the DATA Step is not reached.
Exercise 2 Open the following program c11_2 (see the next slide) • Run the program and check the results. • Change point = obsnum in the SET statement to POINT = 5 and run the program. Observe the results NOTE: POINT = var_name. Var_name can not be a numeric constant such as 5. • Comment out the OUTPUT; statement. Run the program and see the results. NOTE: There is no observation in the data set. Why?
C11_2 program procprint data = mylib.admit; run; Data admitn0; obsnum=5; SET mylib.admit point = obsnum; output; stop; Run; Procprint data=admitn0; run;
Example of reading the odd number of observations Data admitn; Do ; Obsnum = _N_*2 -1; SET mylib.admitpoint = obsnum; If obsnum > 21 then stop; Output; End; Run; Proc print; run;
Detecting the END of a Data Setusing END = option in SET statement To identify the end of a data set allows for creating specific variables or reports that will only be conducted after the end of the data set. For example, One may be interested in only writing to output data set the final observation that consists of the sum of payroll for each subgroup of Jobcode. General Syntax: DATA SAS-dataset; SET sasdataset END= Var_name; NOTE: Var_name is a temporary variable contains end-of-file marker. Var_name = 0 until SET statement reads the last observation of the sas data set. When end-of-file marker is reached, Var_name = 1. This Variable is not added to the sas data set. DO NOT use both POINT= and END = in the same SET statement.
Exercise 3 Open the following program c11_3 • Add the PUT statement to see the values of lastcase before if statement: PUT ID = age = lastcase =; • Run the program and go to SAS LOG to observe the values of ID, age and lastcase. • Comment out the OUTPUT; statement, run the program and observe the results. Make sure you understand the difference between having OUTPUT and without OUTPUT statement.
C11_3 program data diabe5; set mylib.diabetes (keep= id Sex age height weight) end=lastcase; bmi=weight/height**2 * 703; sumbmi+bmi; ncase+1; if lastcase then do; avgbmi=sumbmi/ncase; output; end; run; /*Use lastcase as the indicator and only output the results of the data processing at the last data case. */ procprint label; varncaseavgbmi; label ncase = 'Num of Cases‘ avgbmi = 'Average BMI'; run;
Understanding How SAS Data Sets Are Read Previously, we discussed how SAS handles reading external raw data set. We pointed out the major differences are: When reading external raw data set, prior to reading a new observation, the PDV is initialized to missing. When reading SAS data set, the PDV is initialized only once when the first observation is read. The data values are retained afterwards except the new variables created in the Data Step. The following slides demonstrate how this process is executed when reading SAS data sets.
A Forecasting Application The growth rate of each division of an airline is forecast in mylib.growth. If each of the five divisions grows at its respective rate for the next three years, what will be the approximate size of each division at the end of each of the three years? Partial Listing of mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080
A Forecasting Application The output SAS data set, forecast, should contain 15 observations. Partial Listing of forecast New Division Increase Year Total APTOPS 0.075 1 220.38 APTOPS 0.075 2 236.90 APTOPS 0.075 3 254.67 FINACE 0.040 1 205.92 FINACE 0.040 2 214.16
Automatic Output By default, every DATA step contains an automatic OUTPUT statement at the end of each iteration. This automatic OUTPUT statement tells the SAS System to write observations to the data set or data sets that are created. data forecast; set mylib.growth; <additional SAS statements>; run; 1. Automatic output 2. Automatic return Automatic output to forecast sas data set
The OUTPUT Statement The explicit OUTPUT statement writes the current contents of the PDV to a SAS data set. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed. OUTPUT <SAS-data-set-1 …SAS-data-set-n>;
A Forecasting Application data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run;
Division $ 6 NumEmps N 8 Increase N 8 Explicit Output: Compilation data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV ...
Year N 8 Division $ 6 NumEmps N 8 Increase N 8 Explicit Output: Compilation data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV ...
PDV Year N 8 NewTotal N 8 Division $ 6 NumEmps N 8 Increase N 8 Explicit Output: Compilation data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; ...
PDV Year N 8 NewTotal N 8 Division $ 6 NumEmps N 8 Increase N 8 Explicit Output: Compilation data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; D ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 . Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Initialize PDV to missing Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D . . . ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 . Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 . ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 . Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 1 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 220.38 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; 205*(1+0.075) PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 1 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 220.38 Write first observation to forecast. Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 1 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 220.38 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 2 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 236.90 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; 220.38*(1+0.075) PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 2 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 236.90 Write second observation to forecast. Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 2 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 236.90 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 3 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 254.67 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; 236.90*(1+0.075) PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 3 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 254.67 Write third observation to forecast. Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 3 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 No automatic output NewTotal N 8 254.67 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 3 ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 254.67 Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Automatic return Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 3
Explicit Output: Execution Contents of the FORECAST data set after the first execution of the DATA step: Division Increase Year NewTotal APTOPS 0.075 1 220.38 APTOPS 0.075 2 236.90 APTOPS 0.075 3 254.67
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 . Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; PDV Reinitialize PDV to missing Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D APTOPS 205 0.075 . ...
mylib.growth Num Division Emps Increase APTOPS 205 0.075 FINACE 198 0.040 FLTOPS 187 0.080 NewTotal N 8 . Explicit Output: Execution data forecast; drop NumEmps; set mylib.growth; Year=1; NewTotal=NumEmps*(1+Increase); output; Year=2; NewTotal=NewTotal*(1+Increase); output; Year=3; NewTotal=NewTotal*(1+Increase); output; run; Continue executing DATA step until all observations frommylib.growthare read PDV Division $ 6 NumEmps N 8 Increase N 8 Year N 8 D FINACE 198 0.040 .
Explicitly Controlling Output Partial Log NOTE: There were 5 observations read from the data set MYLIB.GROWTH. NOTE: The data set WORK.FORECAST has 15 observations and 4 variables.
Explicitly Controlling Output proc print data=forecast noobs; format NewTotal 6.; run; Partial PROC PRINT Output New Division Increase Year Total APTOPS 0.075 1 220 APTOPS 0.075 2 237 APTOPS 0.075 3 255 FINACE 0.040 1 206 FINACE 0.040 2 214