300 likes | 418 Views
Efficient SAS programming with Large Data. Aidan McDermott Computing Group, March 2007. Axes if Efficiency. processing speed: CPU real storage: disk memory … user: functionality interface to other systems ease of use learning user development: methodologies reusable code
E N D
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007
Axes if Efficiency • processing speed: • CPU • real • storage: • disk • memory • … • user: • functionality • interface to other systems • ease of use • learning • user development: • methodologies • reusable code • facilitate extension, rewriting • maintenance
General (and obvious) principles • Avoid doing the job if possible • Keep only the data you need to perform a particular task (use drop, keep, where and if’s)
General (and obvious) principles • Often efficient methods were written to perform the required task – use them.
General (and obvious) principles • Often efficient methods were written to perform other tasks – use them with caution. • Write data driven code • it’s easier to maintain data than to update code • Use length statements to limit the size of variables in a dataset to no more than is needed. • don’t always know what size this should be, don’t always produce your own data. • Use formatted data rather than the data itself
Compressing Datasets • Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job • delays execution and there is need to keep track of data and program dependency. • Use a general purpose compression utility and decompress it within SAS for sequential access. • system dependent (need a named pipe), sequential dataset storage.
SAS internal Compression • allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much. • “There is a trade-off between data size and CPU time”.
indata is a large dataset and you want to produce a version of indata without any observations
The data step is a two stage process • compile phase • execute phase
data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: compile phase
data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase
data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase
data admits; set admits; discharge = admit + length; format discharge date8.; run; /* implicit output */ PDV: execute phase
data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase
General principles • Use by processing whenever you can • Given the data below, for each region, siteid, and date, calculate the mean and maximum ozone value.
General principles • Easy:
General principles • Suppose there are multiple monitors at each site and you still need to calculate the daily mean? • Combine multiple observations onto one line and then compute the statistics? • Suppose you want the 10% trimmed mean? • Suppose you want the second maximum? • Use Arrays to sort the data? • Write your own function?