Introduction to Using the Data Step Hash Object with Large Data Sets

Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat

Introduction: The DATA step hash object is an in-memory lookup table accessible from the DATA step. New in version 9. A hash object is loaded with records and is only available within the DATA step that it is created in. A hash record consists of two parts: 1: Key Part: One or more character or numeric values Must be unique 2: Data Part: Zero or more character or numeric values

Lookup occurs by passing a key to the hash object’s FIND method. If a record with the particular key is found, the data part of the hash record is copied into the DATA step record. It is not necessary for the dataset to be sorted or indexed when loading the hash object from a dataset. The key lookup occurs in memory, avoiding costly disk access and speeding up the search considerably. Along with the FIND method, there are methods to ADD, REPLACE, REMOVE and OUTPUT records to the hash object. We will only concentrate on the FIND method in the example to follow.

Example: Dataset A: Pharmacy claims with 9,008,585 records. File size is 3,352,049 kb. (3.35 gb) Dataset B: NDC dataset with drugs of interest. Full NDC file has 272,961 records Full NDC file size 85,489 kb. (85.5 mb) Object: Create a dataset with all pharmacy claims for all subjects in dataset A for the drugs of interest in dataset B.

Dataset B_1: 61 records, file size 65 kb. data Drugs(rename=(NDCnum=NDC)); set raw.redbook; if upcase(ProdNme)='AVONEX' | index(upcase(GenNme),'INTERFERON BETA-1A')>0 | index(upcase(ProdNme),'LUPRON')>0 | index(upcase(GenNme),'LEUPROLIDE')>0 | index(upcase(ProdNme),'ZOLADEX')>0 | index(upcase(GenNme),'GOSERELIN')>0; run; Dataset B_2: 55,786 records, file size 44,641 kb. data Drugs(rename=(NDCnum=NDC)); set raw.redbook(where=(TherCls<50)); run;

Method 1: Sort and Merge proc sort data=Drugs; by NDC; run; proc sort data=in.Pharmacy(rename=(Rx_Prd_ID=NDC)) out=in_Pharmacy; by NDC; run; data Method1; merge in_Pharmacy(in=in1) Drugs(in=in2 keep=NDC ProdNme GenNme); by NDC; if in1 & in2; run; B_1 as Drugs: 3,307 observations B_2 as Drugs: 2,086,842 observations

Method 2: SQL join proc sql; create table Method2 as select d.*, p.* from in.Pharmacy as p join Drugs(keep=NDC ProdNme GenNme) as d on d.NDC=p.Rx_Prd_ID; quit;

Method 3: Format and restrict data Format; set Drugs end=EOF; type='I'; fmtname='NDC'; start=NDC; label=1; output; if EOF then do; hlo='o'; label=0; output; end; run; proc format cntlin=Format; run; data Method3; set in.Pharmacy; if input(Rx_Prd_ID,NDC.)=1 then output; run;  Range is ‘other’ for formatted value (label=0)

Hashing Beginning in version 9, SAS provides two predefined component objects for use in a DATA step: the hash object and the hash iterator object. With these objects you can quickly and efficiently store, search, and retrieve data based on lookup keys. • hash object keys and data are DATA step variables. • key and data values can be • directly assigned constant values • values from a SAS data set. One uses the DATA step Component Interface and object dot notation - ObjectName.Method(Parameters) - to create and manipulate these component objects using statements, attributes, and methods.

Hash Object Creation Declare Statement Hash Object Methods ADD Method CHECK Method DEFINEDATA Method DEFINEDONE Method DEFINEKEY Method DELETE Method FIND Method OUTPUT Method REMOVE Method REPLACE Method Hash Object Attribute NUM_ITEMS Attribute  Instantiates hash object definition  Specify data part of hash object  Completes definition of hash object  Specify key part of hash object  Determine whether record with key exists in hash object

The hash iterator object is also provided as a companion for the hash object. It can not be defined without first defining a hash object. The iterator object enables you to step through the hash object records without performing a key lookup in forward or reverse key order. Hash Iterator Object Methods FIRST Method LAST Method NEXT Method PREV Method

Creating a Hash object 1. Instantiate the hash object: Use the DECLARE statement with the keyword HASH followed by the hash name to create the hash object. Options are specified within the parentheses and include DATASET to load the hash object from an existing SAS dataset and ORDERED to specify how the data is returned in key-value order. 2. Define the key part: Use the DEFINEKEY method to create a key of one or more variables. The key must be unique. 3. Define the data part: Use the DEFINEDATA method to specify zero or more data variables to be associated with the key variables. 4. Complete the definition of the hash object: Use the DEFINEDONE method to conclude the definition of the hash object.

Method 4: Hash object data Method4; length NDC $11 ProdNme GenNme $50; if _n_=1 then do; declare hash h(dataset:"Drugs"); h.defineKey('NDC'); h.defineData('ProdNme','GenNme'); h.defineDone(); call missing(ProdNme,GenNme); end; set in.Pharmacy(rename=(Rx_Prd_ID=NDC)); if h.find()=0 then output; run; • Corrects errors. DefineKey and DefineData do not add variables to PDV. This creates them in the PDV.  1. Create hash h using dataset drugs  2. Define Key Part of h as NDC variable  3. Define Data Part of h  4. Complete the definition of the hash h • Suppresses uninitialized messages. If length statement is not used, the following errors appear in the log ERROR: Type mismatch for data variable ProdNme at line 28 column 5. ERROR: Hash data set load failed at line 28 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase. If call missing is not used, the following notes appear in the log NOTE: Variable ProdNme is uninitialized. NOTE: Variable GenNme is uninitialized If neither statement is used, the following errors appear in the log ERROR: Undeclared data symbol ProdNme for hash object at line 28 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.

Comparisons Each of these 4 methods were run to extract all of the the Rx claims records in dataset A that had the NDC codes in datasets B_1 and B_2. The option fullstimer was turned on to collect times and memory amounts used for comparing methods. • Real time represents the clock time it took to execute a job or step. It is heavily dependent on the capacity of the system and the current load. • CPU time represents the actual processing time required by the CPU to execute the job, exclusive of capacity and load factors. • User CPU is the CPU time spent to execute your SAS code. • System CPU is the CPU time spent to perform system overhead tasks on behalf of the SAS process.

Dataset B_1: 61 records, file size 65 kb.

Dataset B_2: 55,786 records, file size 44,641 kb.

Conclusions: For looking up a value from one dataset in another, the DATA step hash object can be used to improve the performance of the lookup and significantly reduce the time needed to complete the task. The hash object and the hash iterator object allow SAS programmers using version 9 to solve problems within the data step that were either difficult, if not impossible, to code in prior versions of SAS.

References: • Getting Started with the DATA Step Hash Object, Jason Secosky & Janice Bloom, • PharmaSUG 2007 Proceedings • Getting Started with the DATA Step Hash Iterator, Janice Bloom & Jason Secosky, • support.sas.com • -Hashing: Generations, Paul Dorfman & Gregg Snell, SUGI 28 • -DATA Step Hash Objects as Programming Tools, Paul Dorfman & Koen Vyverman, SUGI 30 • -Think FAST! Use Memory Tables (Hashing) for Faster Merging, Greg Snell, SUGI 31 • -A Hash Alternative to the PROC SQL Left Join, Ken Borowiak, NESUG 2006

Introduction to Using the Data Step Hash Object with Large Data Sets

Introduction to Using the Data Step Hash Object with Large Data Sets

Presentation Transcript

Algorithms for Large Data Sets

GADS*: Using Web Services to access large data sets

Indexing Large Trajectory Data Sets With SETI

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

u sing large data sets

Manipulating Large Data Sets

Working with Large Data Sets

Experiences with Large Data Sets

Analysing Large Data Sets using Formal Concept Lattices

using large data sets

Very large data sets

Experiences with Large Data Sets

using large data sets

using large data sets

Interacting with Large Data Sets

Manipulating Large Data Sets