390 likes | 538 Views
The Complexities and Options of Merging/Lookups with Base/SAS . Jennifer Sinodis Metris Companies. Task. Need to Identify specific records in the Detail File based on criteria in the Lookup Information Table. Scenario. Have Two Files:
E N D
The Complexities and Options of Merging/Lookups with Base/SAS Jennifer Sinodis Metris Companies
Task • Need to Identify specific records in the Detail File based on criteria in • the Lookup Information Table Scenario • Have Two Files: • Detail Data in a Comma Delimited Text File (csv) with a Unique Key • Lookup Information in a SAS Data Set
SAS Options • MERGE Statement • PROC SQL • PUT Statement • PROC FORMAT • MODIFY Statement
Build Master_Data.csv /* Define Library for Permanent Data Sets */ LIBNAME VALSUG '/home/valsug'; /* Create Data Set to recreate Scenario */ DATA VALSUG.MASTER_DATA_CHK; INPUT ACCT_NO BRANCH_NO $ PRODUCT $; CARDS; 111222 1000A CC 111333 1000B CC 111555 1000C HE 111666 1000D CC 111777 1000B HE 111888 1000E CC 111999 1000A HE ; RUN; /* For Example: Detail Data File is unique by Account Number (Acct_No) with Branch Number (Branch_No), and Product Type (Product) */
Build Master_Data.csv (continued) /* Export SAS Data Set to create Comma Delimited Text File (.csv) */ DATA _NULL_; SET VALSUG.MASTER_DATA_CHK; FILE '/home/25495/sample_master_data.csv' delimiter = ',' DSD dropover LRECL=32767; FORMAT ACCT_NO BEST12.; FORMAT BRANCH_NO $8.; FORMAT PRODUCT $8.; PUT ACCT_NO @; PUT BRANCH_NO $ @; PUT PRODUCT $; RUN;
Build Branch_Table /* Create Lookup Branch Table */ DATA VALSUG.BRANCH_TABLE; INPUT BRANCH_NO $ BRANCH_NAME $ REGION $; CARDS; 1000A GREEN MIDWEST 1000B EASTON WEST 1000C MIDVILLE WEST 1000D COLBY EAST 1000E SUDDER EAST 1000F ALBANY EAST ; RUN; /* For Example: Lookup Table is Branch Information unique by Branch Number (Branch_No with Branch Name (Branch_Name) and Region Information */
Using MERGE Statement /* In order to use MERGE Statement, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN;
Using MERGE Statement - LOG 54 /*---------------------------------------------------------------------------------*/ 55 DATA MASTER_DATA; 56 INFILE '/home/sample_master_data.csv' delimiter = ','; 57 INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; 58 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA has 7 observations and 3 variables. NOTE: DATA statement used: real time 0.12 seconds cpu time 0.03 seconds
Using MERGE Statement (continued) /* Both Data Sets must be Sorted to Use MERGE Statement */ PROC SORT DATA=MASTER_DATA; BY BRANCH_NO; RUN; PROC SORT DATA=VALSUG.BRANCH_TABLE; BY BRANCH_NO; RUN; DATA MASTER_DATA_EAST; MERGE MASTER_DATA (IN=INA) VALSUG.BRANCH_TABLE (IN=INB); BY BRANCH_NO; IF INA; IF REGION IN ('EAST') THEN OUTPUT; RUN; TITLE 'Using MERGE Statement'; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/
Using MERGE Statement (continued) - LOG 60 /* USING MERGE STATEMENT */ 61 PROC SORT DATA=MASTER_DATA; 62 BY BRANCH_NO; 63 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: The data set WORK.MASTER_DATA has 7 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.06 seconds cpu time 0.01 seconds 64 65 PROC SORT DATA=VALSUG.BRANCH_TABLE; 66 BY BRANCH_NO; 67 RUN; NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set VALSUG.BRANCH_TABLE has 6 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.04 seconds cpu time 0.01 seconds
Using MERGE Statement (continued) - LOG 69 DATA MASTER_DATA_EAST; 70 MERGE MASTER_DATA (IN=INA) VALSUG.BRANCH_TABLE (IN=INB); 71 BY BRANCH_NO; 72 IF INA; 73 IF REGION IN ('EAST') THEN OUTPUT; 74 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set WORK.MASTER_DATA_EAST has 2 observations and 5 variables. NOTE: DATA statement used: real time 0.08 seconds cpu time 0.02 seconds 75 76 TITLE 'Using MERGE Statement'; 77 PROC PRINT; 78 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST. NOTE: PROCEDURE PRINT used: real time 0.02 seconds cpu time 0.00 seconds
Using MERGE Statement - OUTPUT Using MERGE Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME REGION 1 111666 1000D CC COLBY EAST 2 111888 1000E CC SUDDER EAST
Using PROC SQL /* In order to use PROC SQL, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN; PROC SQL; CREATE TABLE MASTER_DATA_EAST2 AS SELECT A.*, B.BRANCH_NAME FROM MASTER_DATA A LEFT JOIN VALSUG.BRANCH_TABLE B ON A.BRANCH_NO = B.BRANCH_NO WHERE REGION EQ 'EAST'; QUIT; TITLE 'Using PROC SQL'; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/
Using PROC SQL - LOG 91 PROC SQL; 92 CREATE TABLE MASTER_DATA_EAST2 AS 93 SELECT A.*, B.BRANCH_NAME 94 FROM MASTER_DATA A LEFT JOIN VALSUG.BRANCH_TABLE B 95 ON A.BRANCH_NO = B.BRANCH_NO 96 WHERE REGION EQ 'EAST'; NOTE: Table WORK.MASTER_DATA_EAST2 created, with 2 rows and 4 columns. 97 QUIT; NOTE: PROCEDURE SQL used: real time 0.09 seconds cpu time 0.02 seconds 98 99 TITLE 'Using PROC SQL '; 100 PROC PRINT; 101 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST2. NOTE: PROCEDURE PRINT used: real time 0.02 seconds cpu time 0.02 seconds
Using PROC SQL - OUTPUT Using PROC SQL BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER
Using PUT Statement /* In order to use PUT Statement, must convert csv file to a SAS Data Set */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; RUN; DATA _NULL_; FILE '/home/Build_Branch_Info.sas'; SET VALSUG.BRANCH_TABLE; PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN DO;"; PUT "BRANCH_NAME = '" BRANCH_NAME + (-1) "'; "; PUT "REGION = '" REGION + (-1) "'; END; "; RUN;
Using PUT Statement - LOG 113 DATA _NULL_; 114 FILE '/home/Build_Branch_Info.sas'; 115 SET VALSUG.BRANCH_TABLE; 116 PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN DO;"; 117 PUT "BRANCH_NAME = '" BRANCH_NAME + (-1) "'; "; 118 PUT "REGION = '" REGION + (-1) "'; END; "; 119 RUN; NOTE: The file '/home/Build_Branch_Info.sas' is: File Name=/home/Build_Branch_Info.sas, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r-- NOTE: 18 records were written to the file '/home/Build_Branch_Info.sas'. The minimum record length was 22. The maximum record length was 32. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: DATA statement used: real time 0.13 seconds cpu time 0.00 seconds
View Build_Branch_Info.sas IF BRANCH_NO EQ '1000A' THEN DO; BRANCH_NAME = 'GREEN'; REGION = 'MIDWEST'; END; IF BRANCH_NO EQ '1000B' THEN DO; BRANCH_NAME = 'EASTON'; REGION = 'WEST'; END; IF BRANCH_NO EQ '1000C' THEN DO; BRANCH_NAME = 'MIDVILLE'; REGION = 'WEST'; END; IF BRANCH_NO EQ '1000D' THEN DO; BRANCH_NAME = 'COLBY'; REGION = 'EAST'; END; IF BRANCH_NO EQ '1000E' THEN DO; BRANCH_NAME = 'SUDDER'; REGION = 'EAST'; END; IF BRANCH_NO EQ '1000F' THEN DO; BRANCH_NAME = 'ALBANY'; REGION = 'EAST'; END;
Using PUT Statement (continued) DATA MASTER_DATA_EAST3; SET MASTER_DATA; LENGTH REGION $8. BRANCH_NAME $8.; %inc '/home/Build_Branch_Info.sas'; IF REGION EQ 'EAST' THEN OUTPUT; RUN; TITLE 'Using PUT Statement '; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/
Using PUT Statement (continued) - LOG 121 DATA MASTER_DATA_EAST3; 122 SET MASTER_DATA; 123 124 LENGTH REGION $8. BRANCH_NAME $8.; 125 %inc '/home/Build_Branch_Info.sas'; 144 145 IF REGION EQ 'EAST' THEN OUTPUT; 146 RUN; NOTE: There were 7 observations read from the data set WORK.MASTER_DATA. NOTE: The data set WORK.MASTER_DATA_EAST3 has 2 observations and 5 variables. NOTE: DATA statement used: real time 0.10 seconds cpu time 0.01 seconds 148 TITLE 'Using PUT Statement '; 149 PROC PRINT; 150 RUN; NOTE: There were 2 observations read from the data set WORK.MASTER_DATA_EAST3. NOTE: PROCEDURE PRINT used: real time 0.03 seconds cpu time 0.00 seconds
Using PUT Statement - OUTPUT Using PUT Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER
Using PROC FORMAT • /* Create a format based on Branch Table */ • /*---------------------------------------------------------------------------------*/ • DATA FORMAT(KEEP = START LABEL FMTNAME); • SET VALSUG.BRANCH_TABLE; • WHERE REGION EQ 'EAST'; • START = BRANCH_NO; • LABEL = 'EAST'; • FMTNAME = '$EBRANCH'; • RUN; • PROC FORMAT CNTLIN=FORMAT; • RUN; • *CNTLIN=input-control-SAS-data-set • specifies a SAS data set from which PROC FORMAT builds informats and formats. • CNTLIN= builds formats and informats without using a VALUE, PICTURE, or INVALUE statement.
Using PROC FORMAT - LOG 158 DATA FORMAT(KEEP = START LABEL FMTNAME); 159 SET VALSUG.OLD_BRANCH_TABLE; 160 WHERE REGION EQ 'EAST'; 161 START = BRANCH_NO; 162 LABEL = 'EAST'; 163 FMTNAME = '$EBRANCH'; 164 RUN; NOTE: There were 3 observations read from the data set VALSUG.OLD_BRANCH_TABLE. WHERE REGION='EAST'; NOTE: The data set WORK.FORMAT has 3 observations and 3 variables. NOTE: DATA statement used: real time 0.08 seconds cpu time 0.01 seconds 172 PROC FORMAT CNTLIN=FORMAT; NOTE: Format $EBRANCH has been output. 173 RUN; NOTE: PROCEDURE FORMAT used: real time 0.13 seconds cpu time 0.02 seconds NOTE: There were 3 observations read from the data set WORK.FORMAT.
Using PROC FORMAT (continued) DATA _NULL_; FILE '/home/Build_Branch_Name.sas'; SET VALSUG. BRANCH_TABLE; PUT "IF BRANCH_NO EQ '" BRANCH_NO + (-1) "' THEN BRANCH_NAME = '" BRANCH_NAME + (-1) "';"; RUN; DATA MASTER_DATA_EAST4; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO @8 BRANCH_NO $ @; IF PUT(BRANCH_NO, $EBRANCH.) = 'EAST’ ; INPUT @14 PRODUCT $2.; LENGTH BRANCH_NAME $8.; %inc '/home/Build_Branch_Name.sas'; RUN; TITLE 'Using PROC FORMAT '; PROC PRINT; RUN; /*---------------------------------------------------------------------------------*/
Using PROC FORMAT (continued) - LOG 175 DATA MASTER_DATA_EAST4; 176 INFILE '/home/sample_master_data.csv' delimiter = ','; 177 INPUT @1 ACCT_NO @8 BRANCH_NO $ @; /* Use the Trailing @ */ 178 IF PUT(BRANCH_NO, $EBRANCH.) = 'EAST’ ; 179 INPUT @14 PRODUCT $2.; 180 LENGTH BRANCH_NAME $8.; 182 %inc '/home/25495/Build_Branch_Name.sas'; 189 190 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA_EAST4 has 2 observations and 4 variables. NOTE: DATA statement used: real time 0.17 seconds cpu time 0.04 seconds
Using PROC FORMAT - OUTPUT Using PROC FORMAT BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME 1 111666 1000D CC COLBY 2 111888 1000E CC SUDDER
Using MODIFY Statement /* In order to use MODIFY Statement, must convert csv file to a SAS Data Set */ /* Must also create variables need from lookup table */ /*---------------------------------------------------------------------------------*/ DATA MASTER_DATA; INFILE '/home/sample_master_data.csv' delimiter = ','; INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; LENGTH BRANCH_NAME $8. REGION $8.; BRANCH_NAME = ' '; REGION = ' '; RUN; /* Create an Index on the Master_Data */ PROC DATASETS LIBRARY=WORK NOLIST; MODIFY MASTER_DATA5; INDEX CREATE BRANCH_NO; RUN; /* Sort Lookup Table by same Indexed Variable on Master_Data */ PROC SORT DATA=VALSUG.BRANCH_TABLE OUT=VALSUG. BRANCH_TABLE (RENAME=(BRANCH_NAME=BRANCH_NAMEX REGION = REGIONX)); BY BRANCH_NO; RUN;
Using MODIFY Statement - LOG 202 DATA MASTER_DATA5 ; 203 INFILE '/home/25495/sample_master_data.csv' delimiter = ','; 204 INPUT @1 ACCT_NO 6. @8 BRANCH_NO $ @14 PRODUCT $2.; 205 206 LENGTH BRANCH_NAME $8. REGION $8.; 207 BRANCH_NAME = ' '; 208 REGION = ' '; 209 RUN; NOTE: The infile '/home/sample_master_data.csv' is: File Name=/home/sample_master_data.csv, Owner Name=*****,Group Name=staff, Access Permission=rw-r--r--, File Size (bytes)=112 NOTE: 7 records were read from the infile '/home/sample_master_data.csv'. The minimum record length was 15. The maximum record length was 15. NOTE: The data set WORK.MASTER_DATA5 has 7 observations and 5 variables. NOTE: DATA statement used: real time 0.15 seconds cpu time 0.03 seconds
Using MODIFY Statement - LOG 211 PROC DATASETS LIBRARY=WORK NOLIST; 212 MODIFY MASTER_DATA5; 213 INDEX CREATE BRANCH_NO; NOTE: Simple index BRANCH_NO has been defined. 214 RUN; NOTE: PROCEDURE DATASETS used: real time 0.18 seconds cpu time 0.00 seconds 216 PROC SORT DATA=VALSUG.BRANCH_TABLE OUT=VALSUG.BRANCH_TABLE 217 (RENAME=(BRANCH_NAME=BRANCH_NAMEX REGION = REGIONX)); 218 BY BRANCH_NO; 219 RUN; NOTE: Input data set is already sorted; it has been copied to the output data set. NOTE: There were 6 observations read from the data set VALSUG.BRANCH_TABLE. NOTE: The data set VALSUG.BRANCH_TABLE has 6 observations and 3 variables. NOTE: PROCEDURE SORT used: real time 0.06 seconds cpu time 0.01 seconds
Using MODIFY Statement (continued) DATA MASTER_DATA5; SET VALSUG.BRANCH_TABLE; DO UNTIL (_IORC_ = %SYSRC(_dsenom)); MODIFY MASTER_DATA5 KEY= BRANCH_NO; IF _IORC_ = 0 THEN DO; BRANCH_NAME = BRANCH_NAMEX; REGION = REGIONX; IF REGION = 'EAST' THEN REPLACE; ELSE REMOVE; END; ELSE DO; _ERROR_=0; END; END; RUN; TITLE 'Using DATA MODIFY '; PROC PRINT DATA=MASTER_DATA5; RUN; /*-----------------------------------------------------------------*/ • _IORC_ = 0 indicates that the MODIFY statement was successful, and that a match was found in the transaction data set. • Override the _ERROR_ flag to prevent SAS from dumping out any records in the transaction data set that are not in the master data set. • Use the autocall macro %SYSRC with _IORC_ to monitor success of MODIFY statement. • _dsenom indicates no matching observation, used with the KEY= option.
Using MODIFY Statement (continued) - LOG 228 DATA MASTER_DATA5; 229 SET VALSUG.OLD_BRANCH_TABLE; 230 DO UNTIL (_IORC_ = %SYSRC(_dsenom)); 231 MODIFY MASTER_DATA5 KEY=BRANCH_NO; 232 IF _IORC_ = 0 THEN DO; 233 BRANCH_NAME = BRANCH_NAMEX; 234 REGION = REGIONX; 235 IF REGION = 'EAST' THEN REPLACE; 236 ELSE REMOVE; 237 END; 238 ELSE DO; 239 _ERROR_=0; 240 END; 241 END; 242 RUN; NOTE: There were 6 observations read from the data set VALSUG.OLD_BRANCH_TABLE. NOTE: The data set WORK.MASTER_DATA5 has been updated. There were 2 observations rewritten, 0 observations added and 5 observations deleted. NOTE: DATA statement used: real time 0.21 seconds cpu time 0.03 seconds
Using MODIFY Statement - OUTPUT Using MODIFY Statement BRANCH_ BRANCH_ Obs ACCT_NO NO PRODUCT NAME REGION 1 111666 1000D CC COLBY EAST 2 111888 1000E CC SUDDER EAST
Conclusion • Resulting data sets were all very similar, and accomplished our task. • Exactly which scenario you should use really depends on the data you are working with and your available resources. • Ask yourself the following: • Are the data sets indexed? • Do I have a space or processing issue with rebuilding my entire master data set? • Do I need to beware of CPU issues?