580 likes | 1.51k Views
SAS: ARRAY PROCESSING. Jordan Elm. INTRODUCTION. Most mathematical and computer languages have some notation for repeating. EG: a matrix, a vector, a dimension, a table In a SAS data step, this structure is called an Array. A group of variables defined in a data step.
E N D
SAS: ARRAY PROCESSING Jordan Elm
INTRODUCTION • Most mathematical and computer languages have some notation for repeating. • EG: a matrix, a vector, a dimension, a table • In a SAS data step, this structure is called an Array. • A group of variables defined in a data step. • Array elements don’t need to be contiguous, the same length, or even related at all. • All elements must be character or numeric.
Why do we need SAS arrays? • Use arrays to help read and analyze repetitive data with a minimum of coding. • An array and a loop can make the program smaller. • Examples of Use • Recoding variables(eg. missing values set to -999) • Applying the same computation to many variables simultaneously (eg. Fahrenheit to Celsius) • Computing new variables (eg. Continuous to Binary) • Reshaping Data (Wide to Long/ Long to Wide)
Example: Applying the same computation to many variables simultaneously • For each record (row) there are 24 variables (temp1-temp24) with the temperatures for each hour of the day. • Temps are in Fahrenheit and need to convert them to Celsius. data; inputetc. celsius_temp1 = 5/9(temp1 – 32); celsius_temp2 = 5/9(temp2 – 32); ... celsius_temp24 = 5/9(temp24 – 32); run;
Define arrays • Define arrays and use a loop data ; inputetc.; array temperature_array {24} temp1-temp24; array celsius_array {24} celsius_temp1-celsius_temp24; do i = 1to24; celsius_array{i} = 5/9(temperature_array{i} – 32); end; run;
Recoding Variables • Missing coded as -999 data...; set...; array inc[12] faminc1 - faminc12; do i = 1to12; if inc[i]=-999then inc[i]=.; end;
Of Note • While TEMP1 is equivalent to the first element, TEMP2 to the second etc., the variables do not need to be named consecutively. • The array would work just as well with non-consecutive variable names. array sample_array {5} x a i r d; • In this example, the variable x is equivalent to the first element, a to the second etc.
BASIC ARRAY CONCEPTS • SAS arrays are another way to temporarily group and refer to SAS variables. • A SAS array is not a new data structure, the array name is not a variable, and arrays do not define additional variables. • Rather, a SAS array provides a different name to reference a group of variables.
BASIC ARRAY CONCEPTS • The ARRAY statement defines variables to be processed as a group. • The variables referenced by the array are called elements. • Once an array is defined, the array name and an index reference the elements of the array. • Since similar processing is generally completed on the array elements, references to the array are usually found within DO groups.
ARRAY STATEMENT • The statement used to define an array is the ARRAY statement. array array-name {n} <$> <length> array-elements <(initial-values)>; • array-name – Any valid SAS name • n – Number of elements within the array • $ - Indicates the elements within the array are character type variables • length – A common length for the array elements • array-elements – List of SAS variables to be part of the array • initial values – Provides the initial values for each of the array elements
BASIC CONCEPTS (cont) • The ARRAY statement is a compiler statement within the data step. • Array elements cannot be used in compiler statements such as DROP or KEEP. • An array must be defined within the data step prior to being referenced or an error will occur. • Defining an array within one data step and referencing the array within another data step will cause errors. • Must define an array within every data step where the array will be referenced
Array Statements: Special Variables • When all numeric or all character variables in the data set are to be elements within the array, no need to list the individual variables as elements. • _NUMERIC_ - when all the numeric variables will be used as elements • _CHARACTER_ - when all the character variables will be used as elements • _ALL_ - when all variables on the data set will be used as elements and the variables are all the same type array sample_array {5} _ALL_;
Number of Elements • N is the array subscript in the array definition and it refers to the number of elements within the array. • A numeric constant, • a variable whose value is a number, • a numeric SAS expression, • or an asterisk (*) • The subscript must be enclosed within • braces {}, • square brackets [], • or parentheses (). array sample_array {5} _ALL_;
Number of Elements • When the asterisk is used, it is not necessary to know how many elements are contained within the array. SAS will count the number of elements for you. • An example of using the asterisk is when one of the special variables defines the elements. array allnums {*} _numeric_; • When it is necessary to know how many elements are in the array, the DIM function can be used to return the count of elements. do i = 1todim(allnums); allnums{i} = round(allnums{i},.1); end;
ARRAY REFERENCES • When an array is defined with the ARRAY statement SAS creates an array reference. The array reference is in the following form: array-name{n} • The value of n will be the element’s position within the array. • For example, in the temperature array the temperature for 1:00 PM is in the variable TEMP13. • Therefore, the array reference will be: temperature_array{13}
ARRAY REFERENCES • The variable name and the array reference are interchangeable. Variable NameArray Reference temp1 temperature_array{1} temp2 temperature_array{2} temp3 temperature_array{3} • An array reference may be used within the data step in almost any place other SAS variables may be used including as an argument to many SAS functions.
USING ARRAY INDEXES • The array index is the range of array elements. • In SAS, subscripts are 1-based by default where arrays in other languages may be 0-based. • When we set the array bounds with the subscript and only specify the number of elements within the array as our upper bound, the lower bound is by default 1. • There may be scenarios when we want the index to begin at a lower bound other than 1. • Say we only want the temperatures for the daytime, temperatures 6 through 18. • arraytemperature_array {6:18} temp6 – temp18;
Identify patterns across variables using arrays • The objective is to identify the number of missing values for each row. • Create a new variable named nmiss, which will be the number of missing values across variables faminc1 - faminc12 data mspatterns; set recode_missing; array inc(12) faminc1-faminc12; /* existing vars */ nmiss = 0; do i = 1to12; if inc(i) = .then nmiss = nmiss + 1; end; run;
Reshaping DATA from Wide to Long data wide; input famid faminc96 faminc97 faminc98; cards; 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000 ; run;
Reshaping wide to long:creating only one variable using arrays data long; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96to98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run;
LONG DATA Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
TEMPORARY ARRAYS • A temporary array is an array that only exists for the duration of the data step where it is defined. • Useful for storing constant values (for calculations). • No corresponding variables to identify the array elements. • The elements are defined by the key word _TEMPORARY_.
TEMPORARY ARRAYS array rate {6} _temporary_ (0.05 0.08 0.12 0.20 0.27 0.35); • The asterisk subscript cannot be used when defining a temporary array and explicit array bounds must be specified for temporary arrays.
TEMPORARY ARRAY • For example: when a customer is delinquent in payment of their account balance, a penalty is applied. The amount of the penalty depends upon the number of months that the account is delinquent. • Without array processing : if month_delinquent eq 1then balance= balance + (balance*0.05); elseif month_delinquent eq 2then balance= balance + (balance * 0.08); elseif month_delinquent eq 3then balance= balance + (balance * 0.12); elseif month_delinquent eq 4then balance= balance + (balance * 0.20); elseif month_delinquent eq 5then balance= balance + (balance * 0.27); elseif month_delinquent eq 6then balance= balance + (balance * 0.35);
TEMPORARY ARRAY • Simplifies the code, and improves performance time. data ...; set...; array rate {6} _temporary_ (0.050.080.120.200.270.35); if month_delinquent ge 1 and month_delinquent le 6then balance = balance + (balance * rate{month_delinquent});
TEMPORARY ARRAY • Setting initial values is not required on the ARRAY statement. The values within a temporary array may be set in another manner within the data step. array rateb {6} _temporary_; do i = 1to6; rateb{i} = i * 0.5; end;
Implicit Arrays array clin q1b q2b q3b q4b q5b q6b q7b q8b q9b q10b q11b q12b q13b; DOOVER clin; if clin in (2,.) then clin=0; elseif clin > 2then clin=1; end; data baselinedemo; set baselinedemo; array zero male Indian Asian black white more hisp rhand; doover zero; if zero=.then zero=0; end; run;
WHEN TO USE ARRAYS • It makes sense to use arrays when there are repetitive VARIABLES that are related WITHIN A SINGLE DATASTEP and the programmer needs to iterate though most of them. • The combination of arrays and do loops in the data step lend power to programming. • The variables in the array do not need to be related or contiguous
COMMON ERRORS AND MISUNDERSTANDINGS • INVALID INDEX RANGE • FUNCTION NAME AS AN ARRAY NAME • ARRAY REFERENCED IN MULTIPLE DATA STEPS, BUT DEFINED IN ONLY ONE
LIMITATIONS OF ARRAY STATEMENTS • Can only be used within a DATA Step (not a PROC). If want to do same action for several datasteps/procs, macro approach may be easier • SAS Array references cannot be used as: • As an input to a MACRO parameter • In a FORMAT, LABEL, DROP, KEEP, LENGTH or OUTPUT statement • SAS Arrays refer to Variables or Constants (not Datasets or the value of a variable)
Using MACROS To Define ARRAYS Within a Single Datastep data; inputetc. array temperature_array {24} temp1-temp24; array celsius_array {24} celsius_temp1-celsius_temp24; do i = 1to24; celsius_array{i} = 5/9(temperature_array{i} – 32); end; run; %macrow; data; input etc. %do i = 1%to24; celsius&i = 5/9(temp&i – 32); %end; run; %mend; %w;
CREATE A MACRO-DEFINED "ARRAY" Using %LET x=ARRAY ELEMENTS (VARIABLES) %DO … %LET y= %SCAN MORE FLEXIBLE, -Switching from Datastep to Proc -Using Previously defined MACROS
%LET and %SCAN • Creates a macro variable and assigns it a value • %LET can be used inside or outside of a macro program. CAN BE A STRING OF WORDS. Syntax %LET macro-variable =<value>; %SCAN Search for a word that is specified by its position in a string
MACRO Arrays • Programs are reusable and easier to understand. • %byid is a MACRO I wrote which I use a lot. %let x=q1b q2b q3b q4b q5b q6b q7b q8b q9b q10b q11b q12b; %do i=1%to12; %let y=%scan(&x, &i); %byid(&y) %end; • Don’t forget to embed this code within %macro and %mend;
MULTI-DIMENSION ARRAYS • Multi-dimensional arrays may be created in two or more dimensions. • Conceptually, a two-dimensional array is a table with rows and columns (Although to SAS, it is still a group of variables) • Within the Program Data Vector the variable structure may be visualized as:
MULTI-DIMENSION ARRAYS • The array statement to define this two-dimensional array will be: array sale_array {3, 12} sales1-sales12 exp1-exp12 comm1-comm12; • Number of elements indicates # of rows (1st dimension), and # of columns (2nd dimension). • Must reference the element number for both dimensions. • The reference to the sixth element for the expense group in the sales array is: sale_array{2,6} refers to EXP6 • Three and more dimensions can be defined as well.
References • Steve First and Teresa Schudrowitz. Arrays Made Easy: An Introduction to Arrays and Array Processing. Paper 242-30. SUGI 30 http://www2.sas.com/proceedings/sugi30/242-30.pdf • Steve First and Teresa Schudrowitz, Systems Seminar Consultants, Inc., Madison, WI Introduction to SAS. UCLA: Academic Technology Services, Statistical Consulting Group. from http://www.ats.ucla.edu/stat/sas/notes2/. http://www.ats.ucla.edu/stat/sas/seminars/SAS_arrays/default_new.htm