250 likes | 355 Views
SAS Hash Object: My New Best Friend. Demonstration Of Time Savings Using A Hash Object By Denise A. Kruse SAS Contractor. Program Objectives. disposition 97 obs. campaign 11,346 obs. dec_offers 6,145,029 obs. program 446 obs. disposition category 6 obs. Matching Datasets.
E N D
SAS Hash Object: My New Best Friend Demonstration Of Time Savings Using A Hash Object By Denise A. Kruse SAS Contractor
Program Objectives disposition 97 obs campaign 11,346 obs dec_offers 6,145,029 obs program 446 obs disposition category 6 obs
Matching Datasets What is the best way to get the fields from the 4 small datasets into the main population of 6.1 million observations? • PROC merge • HASH
PROC merge • Both datasets need to be sorted prior to the merge • Merge datasets • Sort again • Merge again
Sort / Merge Code procsort data=oms_prod.disposition out=disp ; by disposition_id ; run ; procsort data=dec_offers ; by disposition_id ; run ; data dec_match ; merge dec_offers (in=a) disp(keep=disposition_id description touched disposition_category_code in=b) ; by disposition_id ; if a and b ; run ;
Log For Sort / Merge NOTE: Sorting was performed by the data source. NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION. NOTE: The data set WORK.DISP has 97 observations and 10 variables. NOTE: Compressing data set WORK.DISP decreased size by 0.00 percent. Compressed is 2 pages; un-compressed would require 2 pages. NOTE: PROCEDURE SORT used (Total process time): real time 0.23 seconds cpu time 0.00 seconds NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS. NOTE: The data set WORK.DEC_OFFERS has 6145029 observations and 4 variables. NOTE: Compressing data set WORK.DEC_OFFERS increased size by 58.15 percent. Compressed is 38412 pages; un-compressed would require 24289 pages. NOTE: PROCEDURE SORT used (Total process time): real time 28.44 seconds cpu time 39.81 seconds NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS. NOTE: There were 97 observations read from the data set WORK.DISP. NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables. NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27499 pages; un-compressed would require 109733 pages. NOTE: DATA statement used (Total process time): real time 42.81 seconds cpu time 42.58 seconds
Sort / Merge Code Continued procsort data=oms_prod.campaign out=camp ; by campaign_id ; run ; procsort data=dec_match ; by campaign_id ; run ; data dec_match2 ; merge dec_match (in=a) camp(keep=campaign_id program_id campaign_code description in=b) ; by campaign_id ; if a and b ; run ;
Log For Sort / Merge NOTE: Sorting was performed by the data source. NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN. NOTE: The data set WORK.CAMP has 11346 observations and 19 variables. NOTE: Compressing data set WORK.CAMP decreased size by 43.48 percent. Compressed is 143 pages; un-compressed would require 253 pages. NOTE: PROCEDURE SORT used (Total process time): real time 0.67 seconds cpu time 0.43 seconds NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH. NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables. NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27496 pages; un-compressed would require 109733 pages. NOTE: PROCEDURE SORT used (Total process time): real time 1:09.07 cpu time 1:59.52 NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH. NOTE: There were 11346 observations read from the data set WORK.CAMP. NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables. NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages. NOTE: DATA statement used (Total process time): real time 51.29 seconds cpu time 51.05 seconds
Sort / Merge Code Continued proc sort data=oms_prod.program out=pgm ; by program_id ; run ; proc sort data=dec_match2 ; by program_id ; run ; data dec_match3 ; merge dec_match (in=a) pgm(keep=program_id name in=b) ; by program_id ; if a and b ; run ;
Log For Sort / Merge NOTE: Sorting was performed by the data source. NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM. NOTE: The data set WORK.PGM has 446 observations and 16 variables. NOTE: Compressing data set WORK.PGM decreased size by 40.00 percent. Compressed is 6 pages; un-compressed would require 10 pages. NOTE: PROCEDURE SORT used (Total process time): real time 0.25 seconds cpu time 0.03 seconds NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2. NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables. NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 71.53 percent. Compressed is 34306 pages; un-compressed would require 120491 pages. NOTE: PROCEDURE SORT used (Total process time): real time 1:17.37 cpu time 2:02.37 NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2. NOTE: There were 446 observations read from the data set WORK.PGM. NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables. NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26016 pages; un-compressed would require 93107 pages. NOTE: DATA statement used (Total process time): real time 59.00 seconds cpu time 58.97 seconds
Sort / Merge Code proc sort data= oms_prod.disposition_category out=disp_cat(rename=(description=disp_desc)) ; by disposition_category_code ; run ; proc sort data=dec_match3 ; by disposition_category_code ; run ; data dec_match4 ; merge dec_match3 (in=a) disp_cat(keep=disposition_category_code disp_desc in=b) ; by disposition_category_code ; if a and b ; run ;
Log For Sort / Merge NOTE: Sorting was performed by the data source. NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY. NOTE: The data set WORK.DISP_CAT has 6 observations and 2 variables. NOTE: Compressing data set WORK.DISP_CAT increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages. NOTE: PROCEDURE SORT used (Total process time): real time 0.03 seconds cpu time 0.02 seconds NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3. NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables. NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 72.06 percent. Compressed is 26017 pages; un-compressed would require 93107 pages. NOTE: PROCEDURE SORT used (Total process time): real time 1:26.08 cpu time 2:14.65 NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3. NOTE: There were 6 observations read from the data set WORK.DISP_CAT. NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables. NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 71.05 percent. Compressed is 31209 pages; un-compressed would require 107808 pages. NOTE: DATA statement used (Total process time): real time 1:03.35 cpu time 1:03.28
HASH code data dec_match ; if _n_ = 1 then do ; IF 0 then set oms_prod.disposition(keep=disposition_id description touched disposition_category_code ) ; declare hash ht(dataset: "oms_prod.disposition") ; ht.defineKEY("disposition_id ") ; ht.defineData("disposition_id ", "description “ “touched","disposition_category_code") ; ht.defineDone() ; end ; set dec_offers ; if ht.find()=0 ; run ; No sorting !!
HASH Log NOTE: There were 97 observations read from the data set OMS_PROD.DISPOSITION. NOTE: There were 6145029 observations read from the data set WORK.DEC_OFFERS. NOTE: The data set WORK.DEC_MATCH has 6145029 observations and 7 variables. NOTE: Compressing data set WORK.DEC_MATCH decreased size by 74.94 percent. Compressed is 27499 pages; un-compressed would require 109733 pages. NOTE: DATA statement used (Total process time): real time 48.38 seconds cpu time 48.14 seconds
HASH Code data dec_match2 ; if _n_ = 1 then do ; IF 0 then set oms_prod.campaign(keep=campaign_id program_id campaign_code description ) ; declare hash ht(dataset: "oms_prod.campaign") ; ht.defineKEY("campaign_id") ; ht.defineData("campaign_id", "program_id", "campaign_code", "description") ; ht.defineDone() ; end ; set dec_match ; if ht.find()=0 ; run ;
HASH Log NOTE: There were 11346 observations read from the data set OMS_PROD.CAMPAIGN. NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH. NOTE: The data set WORK.DEC_MATCH2 has 6145029 observations and 9 variables. NOTE: Compressing data set WORK.DEC_MATCH2 decreased size by 38.33 percent. Compressed is 39071 pages; un-compressed would require 63352 pages. NOTE: DATA statement used (Total process time): real time 55.35 seconds cpu time 55.21 seconds
HASH Code data dec_match3; if _n_ = 1 then do; IF 0 then set oms_prod.program(keep=program_id name ); declare hash ht(dataset: "oms_prod.program"); ht.defineKEY("program_id"); ht.defineData("program_id", "name"); ht.defineDone(); end; set dec_match2; if ht.find()=0; run;
HASH Log NOTE: There were 446 observations read from the data set OMS_PROD.PROGRAM. NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH2. NOTE: The data set WORK.DEC_MATCH3 has 6145029 observations and 10 variables. NOTE: Compressing data set WORK.DEC_MATCH3 decreased size by 48.53 percent. Compressed is 43928 pages; un-compressed would require 85348 pages. NOTE: DATA statement used (Total process time): real time 1:00.38 cpu time 1:00.17
HASH Code data disposition_category (rename=(description=disp_desc)); set oms_prod.disposition_category; run; data dec_match4; if _n_ = 1 then do; IF 0 then set disposition_category(keep=disposition_category_code disp_desc); declare hash ht(dataset: "disposition_category"); ht.defineKEY("disposition_category_code"); ht.defineData("disposition_category_code", "disp_desc"); ht.defineDone(); end; set dec_match3; if ht.find()=0; run;
HASH Log NOTE: There were 6 observations read from the data set OMS_PROD.DISPOSITION_CATEGORY. NOTE: The data set WORK.DISPOSITION_CATEGORY has 6 observations and 2 variables. NOTE: Compressing data set WORK.DISPOSITION_CATEGORY increased size by 100.00 percent. Compressed is 2 pages; un-compressed would require 1 pages. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds NOTE: There were 6 observations read from the data set WORK.DISPOSITION_CATEGORY. NOTE: There were 6145029 observations read from the data set WORK.DEC_MATCH3. NOTE: The data set WORK.DEC_MATCH4 has 6145029 observations and 11 variables. NOTE: Compressing data set WORK.DEC_MATCH4 decreased size by 49.47 percent. Compressed is 51750 pages; un-compressed would require 102418 pages. NOTE: DATA statement used (Total process time): real time 1:02.45 cpu time 1:02.30
Conclusion When looking for efficiencies HASH objects are definitely worth considering. In larger programs, HASH objects can save valuable processing time.
References Linda Jolley – Using Table Lookup Techniques Efficiently Jason Secosky – The DATA Step In Version 9: What’s New? Paul Dorfman- DATA Step HASH Objects As Programming Tools
Contact Information Denise A. Kruse SAS Contractor DeniseAKruse@gmail.com