290 likes | 509 Views
Hash Objects – Why Use Them?. Carolyn Cunnison SAS Technical Training Specialist. Agenda. What are a Hash objects? When should I use them? Some sample code. What are HASH objects?. Hash object can be thought of as rows of keys and data loaded into memory. Keys. Data. Data.
E N D
Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist
Agenda • What are a Hash objects? • When should I use them? • Some sample code.
What are HASH objects? • Hash object can be thought of as rows of keys and data loaded into memory. Keys Data Data ...
Advantages of Hash Objects • Values can be hard-coded or loaded from a SAS data set. • Keys and data can be a mixture of character and numeric. • Provides in-memory data storage and retrieval. • Does not require that data be sorted. • Is sized dynamically.
When to Use Hash Objects (1) Joining tables • I cut my processing time by 90% using hash tables - You can do it too!; Jennifer K. Warner-Freeman • http://www.nesug.info/Proceedings/nesug07/bb/bb16.pdf • Jennifer took an existing Proc SQL join which took between 2 and 4 hours to run. When she rewrote the program to use Hash tables, the program ran in 11 minutes.
When to Use Hash Objects (2) Summary-less summarization • Hash-Crash and Beyond; Paul Dortman et al • http://www2.sas.com/proceedings/forum2008/037-2008.pdf • Compared PROC SUMMARY with NWAY option to Hash Object proc summary data = input nway ; class k1 k2 ; var num ; output out = summ_sum (drop = _:) sum = sum ; • The Hash Object did “the job more than twice as fast at the same time utilizing ⅓the memory”
When to Use Hash Objects (3) Dynamically output to multiple files • Paul Dortman paper (continued) • Use a Hash table instead of the following: Data out1 out2; Set tablein;… If id = 1 then output out1; Else if id = 2 then output out2;
When to Use Hash Objects (4) Removing data extremes • Knowledge Base Sample 25990 • http://support.sas.com/kb/25/990.html • Removes top and bottom 10% of data values.
When to Use Hash Objects(5) Perform data sampling without Proc Surveyselect • Better Hashing in SAS9.2; Robert Ray and Jason Secosky • http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf • Select observations from a table without replacement. • Perform sampling and data manipulation in one step.
Terminology Partial list of methods:
Business Scenario • You need to read orion.product_list and then look up information in the orion.supplier table. • Structure of orion.product_list • Structure of orion.supplier
Loading Data from a SAS Data Set • data supplier_info; • drop rc; • length Supplier_Name $40 Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address','Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address,Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; p305d02
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; . . . D ...
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; True . . . D ...
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; Implicit OUTPUT; Implicit RETURN; . . . D ...
Execution • data supplier_info; • drop rc; • length Supplier_Name $40 • Supplier_Address $ 45 • Country $ 2; • if _N_=1 then do; • declare hash S(dataset:'orion.supplier'); • S.definekey('Supplier_ID'); • S.definedata('Supplier_Name', • 'Supplier_Address', • 'Country'); • S.definedone(); • call missing(Supplier_Name, • Supplier_Address, • Country); • end; • set orion.product_list; • rc=S.find(); • if rc=0; • run; Continue until EOF . . . D ...
Results proc print data=supplier_info(obs=10); var Product_ID Supplier_ID Supplier_Name Supplier_Address Country; title "Product Information"; run; Partial PROC PRINT Output Product Information Obs Product_ID Supplier_ID Supplier_Name Supplier_Address Country 1 210200100009 3298 A Team Sports 2687 Julie Ann Ct US 2 210200100017 3298 A Team Sports 2687 Julie Ann Ct US 3 210200200022 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 4 210200200023 6153 Nautlius SportsWear Inc 56 Bagwell Ave US 5 210200300006 1303 Eclipse Inc 1218 Carriole Ct US 6 210200300007 1303 Eclipse Inc 1218 Carriole Ct US 7 210200300052 1303 Eclipse Inc 1218 Carriole Ct US 8 210200400020 1303 Eclipse Inc 1218 Carriole Ct US 9 210200400070 1303 Eclipse Inc 1218 Carriole Ct US 10 210200500002 772 AllSeasons Outdoor Clothing 553 Cliffview Dr US
Could I do the same thing with a MERGE ? • Yes. But …… • Would have to sort both tables. • Reading from disk is slower than reading from memory.
What about data size ? • Scalability of Table Lookup Techniques,Rick Langston • http://support.sas.com/resources/papers/proceedings09/037-2009.pdf • Compared Hash table, Sort/Merge, Indexing, Proc SQL and Proc Format as table lookup techniques. • Hash object processing was successful up to around 1,900,000 rows and then ran out of memory.
Did you know that…. • PROC SQL sometimes uses hashing to join tables. • Possible processing methods are: sqxjsl - Step Loop Join (Cartesian product)sqxjm - Merge Joinsqxjndx- Index Joinsqxjhsh- Hash Join • To view the method used: Procsql _method;
The HITER object • The HITER object must point to a HASH object. • Read the HITER using the following methods.
Conclusion • Hash and Hiter objects are very flexible. • Data has to fit into memory. • Results will depend on your data, your environment, and what you are trying to do. • You have to benchmark.
Want to know more? • SAS Programming III: Advanced Techniques and Efficiencies https://support.sas.com/edu/schedules.html?ctry=ca&id=279 • Also available as Live Web course.