360 likes | 1.2k Views
The Skinny on Fat SAS Data Set Processing. By Gopal Rajagopal PhilaSug 2005 Nov 9 th 2005 A quote from a Philly celebrity Benjamin Franklin …
E N D
The Skinny on Fat SAS Data Set Processing By Gopal Rajagopal PhilaSug 2005 Nov 9th 2005 A quote from a Philly celebrity Benjamin Franklin … ‘Our Constitution is in actual operation; everything appears to promise that it will last; but in this world nothing is certain but death and taxes …. and more data ’
What can I do about routine processing of FAT SAS data sets ? Ask for a computer with more/faster … Memory, Storage, CPU’s, SAS V9 SPDY server, bigger size monitor (!) etc. OR Start coding without explicitly taking scaling issues into consideration OR Use a ‘concentric ring’ approach i.e. act locally while thinking globally … • Focus on local SAS program and it’s I/O (the common culprit) • If needed, consider global and more ‘expensive’ solutions.
Agenda • Clarify terminology • Process Management SOP/Tips for Large data set manipulation • Discussion of some coding techniques/approaches • QC for Large data set processing • Questions
Clarify Terminology Who are you calling ‘FAT’ ?!. • FAT is a relative term. A ‘FAT’ data set can have ‘too’ many variables or rows or both. • Any data set is ‘FAT’, if you have ‘trouble’ processing it. • ‘Trouble’ could mean … • Not ‘enough’ Disk storage, RAM, CPU. • Program takes too long to run • Ever changing user demands and associated development/QC burden • Programmer’s sanity/fortitude related stress • Even for smaller data sets, some of the ideas in this presentation may be good SOP.
Clarify Terminology What is processing a ‘FAT’ SAS data set ? • The focus will be on ‘typical’ data step operations (using SAS 8.2) • We will discuss SOPs and profile a few techniques … • SOP/Tips to streamline process management • Reduce overhead(s) such as Disk space, elapsed time • Help make sub-setting of data less onerous • Examine alternatives to merging/sorting data • Discuss faster ways of frequent searching of data
Clarify Terminology‘Efficient’ Processing … huh ? Efficiency can mean the following positive adjective(s) describe your SAS code … • Quicker than status quo • Smaller than status quo • Better coding i.e. coding is more efficient and streamlined. • For the empirically inclined, a baseline is useful in measuring changes to the status quo. Note : These adjectives do not have to be mutually exclusive.
Process Management SOP/Tips for Large data set manipulation • Usually SAS programs do not scale well i.e. there is a big difference between making 1 cup and 1 million cups of coffee. • Localize and attack bottlenecks by using options fullstimer stimer STIMEFMT= Z ; • Your data’s footprint can be heavy. So … • Schedule jobs off peak etc (i.e. be ‘nice’). • Keep colleagues alerted about monster jobs. • The infrastructure person, your friend, does not like surprises
Process Management SOP/Tips for Large data set manipulation • Work on your work library … • Keep work datasets small. Delete unnecessary data sets. • Use a different work lib in a shared environment. • Use tapes (instead of online storage) to keep bean counters less unhappy • For MVS users, the AFFINITY statement can be used to minimize the # of tape drives requested. • Using tapes to store intermediate stages of data might make the process re-startable and help QC the data. • Use a small ‘random’ but ‘representative’ sample for iterative development/QC.
Process Management SOP/Tips for Large data set processing • When ‘nothing’ works .. Divide and rule (!) • Split and process your data into logically consistent chunks. However, you will have to put it together and QC it !. • To reduce code maintenance, use a parameter driven splitting/processing algorithm for both sample and production • Consider serializing program runs if parallel runs cause a resource bottleneck. • If you need to search mountains of data frequently , consider a SPD SAS server or loading the data in Oracle.
Programming Techniques for the Data Step Assess ‘improvement(s)’ for a Data step by using a Baseline/Benchmark as a starting point … • SAS 8.2 (TS2M0) has been used for all programs. • A simulated dataset having 300 million records has been used (50 to 53 GIG). ( proc contents output) • The hardware was HP Model 9000/800. • The OS was HP-UX B.11.11 with 250 Gig of DASD storage. The Unix box had 4 CPUs and 8 GIG RAM. • Measures such as elapsed time may illustrate the relative cost/benefit of a selected technique over another in a comparable context and environment.
Programming Techniques for the Data Step An approach to reducing I/O, a significant bottleneck, could be … • Reduce dataset size (vertically and horizontally) to fit the programming task. • Minimize sequential I/O (SAS default). • Maximize non sequential (i.e. random) I/O. • To make your program faster … • Re-direct sequential I/O to memory • Re-direct non sequential I/O to memory Note : Most systems have lot less of RAM than DASD.
Programming Techniques for the Data Step Slim your SAS data sets by …. • Horizontal i.e. column oriented methods • Without losing data …. • Explicitly specifying LENGTH • Removing padded blanks from character variables. • Using KEEP, DROP to limit variables. • If feasible, using ‘Bit mapping’ technique (Caveats) • SAS Data Compression (Profiles)(Caveats) • Using surrogate keys to shorten ‘long’ variables. • Normalizing your data set.
Programming Techniques for the Data Step Slim your SAS data sets by …. • Vertical Methods i.e. subset the rows to what is needed • Using WHERE instead of IF to subset data • Using a format to subset with a WHERE clause. • Using a macro variable to subset with a WHERE clause • Using Proc SQL (with and without sort) • Using SORT/MERGE (if you have to) (Profile)
Programming Techniques for the Data Step • To minimize I/O you can … • A few ways to simplify program overhead … • Use proc append to concatenate data sets instead of using ‘ set A B’ syntax. • Reduce Divide-by-zero exception overhead (profile) • Minimize IF–THEN-ELSE overhead • Use Proc Datasets • To list work library data sets at different points of a ‘long’ SAS program. • Delete unneeded work data sets in a long program. • Reduce I/O by avoiding a data step for tasks such as assigning or changing variable labels etc.
Programming Techniques for the Data Step To minimize I/O you can … Sort out sorting hassles …. • Reduce data set size • Use TAGSORT option (50.5 vrs 57.4 for regular sort ) • Use SORTEDBY= dataset to sort externally • Fine tune sort parameters such as sort work space etc. Consider alternatives to Sorts/Merges • Proc FORMAT to simulate merge without sort • Proc SQL • POINT (not a speed demon !)
Programming Techniques for the Data Step Using Indices I/O is a significant burden for look up(s). • Proc Format ( using the put function) • Consider Indexing the Large SAS data set for more efficient look up(s) and use … • Proc SQL • KEY= SET Option • WHERE clause (Profile) Caveats on creating/using indices
QC issues in processing large volumes of data • ‘Dividing’ may be easier than ‘ruling’ from a QC/validation perspective. • Use a small ‘random’ and ‘representative’ sample for validating processing and intensive QC. • Compare a ‘before’ set of frequencies and record counts (using raw data) for critical variables with the corresponding ‘final’ numbers.
QC issues in processing large volumes of data • Compare output for critical variables across time to spot anomalous trends (if available) • Output, for a random selection of patient id(s) (for e.g.), data from the final processed data and follow the algorithm back to the ‘raw’ data. • Use tapes to store different intermediate stages of data being processed to follow the QC trail
Acknowledgements I would like to thank my colleagues for their support : Margaret Coughlin Kathy Harkins Donna Usavage Anil Golla Rajavel Ganesan Girish Havildar Lisa Pyle Xingshu Zhu
Slim your dataHorizontal i.e. column oriented methods(1 of 3) Initial baseline size : 50 Gig (with 13 Numeric variables (length 8) and 2 char variables (39 Chars) Initial baseline IF run (in minutes) : 26.7 (which returns 199,800,001 rows) Technique : Use LENGTH to reduce 8 Numeric variables (of default 8 bytes length) to 3 bytes. Size : 39 Gig Time to run IF: 23.2 Note : Size is smaller but IF run time is going down a little. Technique : TRIM and reduce length of the 2 char variables from 39 to 8 characters in addition to numeric variable lengths Size : 19 Gig Time to run IF: 15.4 Note : Size comes down > 50 % and run time also down
Slim your dataHorizontal i.e. column oriented methods(2 of 3) • Technique : DROP variables (including 2 char variables and 5 numeric variables) • Size : 6.4 Gig Time to run IF : 10:06 • Note : Size < 15% of original size and IF run time is down to < 50% • Technique : 3 numeric binary value (0 and 1) variables (length 8) were stored in the bits of a 3 byte integer using bit functions • Original Size : 53.4 Gig Time to run IF = 21.8 (returns112,504,793 rows) • Reduced Size = 48.0 Gig Time to run IF = 28.8 • Notes : Despite smaller size … • Time to run IF is up probably due to bit function overhead to read the bits of a particular variable • It took about 2 hours to create the bit map variable. • In a simple binary variable situation, making the binary value variable into an one byte char variable may be the better option.
Slim your dataHorizontal i.e. column oriented methods(3 of 3) Technique : Compress the data set with Binary option (without keeps or drops) Before compression size : 50 Gig (13 Numeric vars (length 8) and 2 char vars (39 Chars)) • Time to run IF : 26.7 (returning 199,800,001 rows) • Note : It took 1 hour and 10 minutes to compress this data. • After compression size : 21.9 Gig Time to run IF : 28.6 Note : If the output data set was also compressed it took 1 hour.
Vertical Methods (1 of 3) The large data set had 300,000,000 records with 125 values searched returning 37,200,001 of records. Baseline Action : Used IF with ‘IF var in (x,y,z)’ syntax Time taken : 42.6 Note : Plain sequential search with search values listed. Technique : Used WHERE with ‘WHERE var in (¯ovar)’ syntax with a macro variable storing all search values. Time taken : 18.97 Note : Efficiency resulted in only selected rows being read in at outset and search values in memory Technique : Used Proc SQL (after sorting big data set ) Time taken : 23.05 Note : Proc SQL’s optimizer does not seem to care about a sorted data set !.
Vertical Methods (2 of 3) The large data set had 300,000,000 records with 125 values searched returning 37,200,001 records. Technique : Used ‘WHERE var in (x,y,z)’ syntax by listing search values Time taken : 23.18 Note : WHERE better than IF. However, plain listing of search values does not use memory ( with macro variable it took 18 minutes) Technique : Used Proc SQL (No sorting of the large data set ) Time taken : 23.61 Note : SAS’s SQL optimizer is optimizing well here. However, sorting the big data set does not help (it took 23.05). Technique : Created a format to house search values with a WHERE clause using syntax ‘WHERE put(formvar,? formtmp.) EQ '999';’ Time taken : 31.26 Note : Only selected rows come in at the outset. Search values are in memory through the format. Not all memory techniques are the same !
Vertical Methods (3 of 3) The large data set had 300,000,000 records with 125 values searched returning 37,200,001 records. Technique : Used ‘IF var in (¯ovar)’ in a macro variable Time taken : 35.4 Note : I/O in memory for search values even with the extra overhead with ‘IF’ seems to help. The plain vanilla syntax took 42.6 minutes. No sorting used. Technique: SORT/MERGE Time taken for SORT : 60 Time taken for MERGE: 27 Total Time : 87 Note : Sort only if you have to as it is also disk intensive and takes a lot of time
Using SAS indices (1 of 3) We are searching for records with 125 values amongst 300 million records … Scenario 1 : Get back 125 records from 300 million unique values Time taken using IF v_id in (x,y,z) syntax : 43.1 Time taken using WHERE v_id in (x,y,z) syntax : 4.1 Note : WHERE (but not IF) uses the index and reads in what qualifies. The index on column V_ID is specified as unique. Scenario 2 : Get back 72,000,001 rows using Non Unique field with 100 distinct values Time taken using IF formvar in (x,y,z) syntax : 44.2 Time taken using WHERE formvar in (x,y,z) syntax : 9.2 Note : WHERE uses the index and reads in what qualifies. The index on column formvar is specified as non-unique.
Using SAS indices (2 of 3) We are searching for records with 100 values amongst 300 million records using PROC SQL … Scenario 3 : Get back 100 records from 300 million unique values Time taken using SET=KEY option : .2 Note : SET=KEY used an unique index which returned a ‘needle’ in a haystack. The index on column V_ID is also specified as unique. Scenario 4 : Try toget back one-to-many i.e. non unique records from 300 million records for 100 distinct values Note : SET = KEY syntax does not work as it picks only the first key value record which qualifies.
Using SAS indices (3 of 3) We are searching for records with 100 values amongst 300 million records using PROC SQL … Scenario 5 : Get back 100 unique values from 300 million unique value records Time taken using PROC SQL syntax : 3.2 Note : PROC SQL used an unique index which returned a ‘needle’ in a haystack. The index on column V_ID is also specified as unique. Scenario 6 : Get back 297,000,001 records from 300 million records for 100 distinct values Time taken using PROC SQL syntax 43.5 Note : PROC SQL does not use an index on formvar as …. • It is not unique • It expects to return a lot of records i.e. it is more efficient to do simple sequential search rather than reading the index too.
Creating/Using indices (1 of 2) • The use/non use of an index for an indexed column is determined by the SAS I/O ‘cost based’ optimizer. • The ideal index based query is a ‘looking for a needle in a haystack’ scenario. • The usefulness of an index also depends of the cardinality of the column being indexed. • Use the MSGLEVEL=I OPTION to know whether your indices are being used. • Indexing a small data set is probably an I/O overkill.
Creating/Using indices (2 of 2) • B Tree indices, the type that BASE SAS creates, are more effective for highly unique columns and not decision support oriented tasks which typically use non unique columns. • Also consider the cost of the index … • Index sparingly. An index is updated when the data file is updated. • For a 53 Gig file, with 2 indexed columns, the size of the index was 7.2 Gig. • It is better to index a search column whose cardinality is high. • An index, to justify its cost of maintenance, needs to be used (i.e. read) often and updated minimally.
Data Set Name: BASEDAT.TEST Observations: 300000000 Member Type: DATA Variables: 14 Engine: V8 Indexes: 2 Created: 14:46 Wednesday, October 12, 2005 Protection: Compressed:NO Data Set Type: Sorted: NO Data Set Page Size 16384 Number of Data Set Pages:3260871 File Name: /epidb/data002/test.sas7bdat Owner Name: rajagopg File Size (bytes): 53426118656 -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ------------------------------------ 1 formvar Num 8 0 2 i Num 8 8 7 rand_var Num 8 48 4 v1 Num 8 24 5 v2 Num 8 32 6 v3 Num 8 40 8 v11 Num 8 56 9 v12 Num 8 64 10 v13 Num 8 72 11 v14 Num 8 80 12 v15 Num 8 88 3 v_id Num 8 16 13 vchar1 Char 39 96 14 vchar2 Char 39 135 Unique Unique # Index Option Values ----------------------------------- 1 formvar 100 2 v_id YES 300000000
SAS Dataset Compression nuances • Infrequently used/updated data sets are better candidates. • Data sets without ‘long’ character variables are poor candidates for compression. The BINARY option (v8.2) compresses numbers too. • All processing with a compressed data set implies on-the- fly decompression and related overhead. Original data set size 50 Gig Note : 13 Numeric variables (length 8) and 2 character variables (39 Chars) Compressed data set size 21.9 Gig Time to compress data set 60 Time to run IF (after compression) : 28.0 Time to run IF (before compression) : 26.0 Time to run IF 61.0 (with compression of input and output datasets) Note : The number of rows returned was 199,800,001 rows
Bitwise representation and consolidation for some types of numeric data • The approach may be useful for numeric fields which have binary values and no missing values. • Bit oriented functions are relatively obscure. • The data step may need to be repeated if the underlying values change. • For e.g., for the 300 million records, 3 numeric binary value (0 and 1) variables (length 8) were stored in the bits of a 3 byte integer using SAS bit functions. Original Size 53.4 Gig Time to run IF for v1 = 1 21.8 (returns 112,504,793 rows) Reduced Size 48 Gig Time to run IF for v1 = 1 28.8 Note : It took 2 hours to create the bit mapped field. Using the bit flag to subset data entails higher overhead (CPU, memory etc.)
Simplify program overhead and save resources such as time (1 of 2)… • Use ‘proc append’ to concatenate data sets instead of ‘set x y’ method to minimize sequential I/O.. Scenario : 2 identical datasets (large=300 million, small=1 record) are being concatenated Time for ‘set large small’ syntax 27.2 Note : I/O needed for reading both data sets and creating a new data set. The ‘set’ order does not matter . Time for ‘Proc append base=big data=small’ syntax : 0.00(!) Note : Very quick as only the small data set is being read Time for ‘Proc append base=small data=big’ syntax : 16.6 Note : Reading in the ‘big’ data set takes time while a new data set is not being created (as in set x,y method).
Simplify program overhead and save resources such as time (2 of 2) … Minimize processing overhead such as of divide-by-zero SAS error checking and save time Time needed for with divide-by-zero SAS checking for 300 million records 36 Time needed after bypassing with divide-by-zero SASchecking for 300 million records 28.3