230 likes | 340 Views
Preventing data loss when combining SAS datasets. John Cantrell Univ. North Carolina Chapel Hill, NC. SAS has powerful tools for dataset combination. Proc SQL SAS data step. SAS data step. Datastep methods use SET, MERGE, and BY statements. The workhorses:. Matched merge & concatenation.
E N D
Preventing data loss when combining SAS datasets. John Cantrell Univ. North Carolina Chapel Hill, NC
SAS has powerful tools for dataset combination • Proc SQL • SAS data step
SAS data step • Datastep methods use SET, MERGE, and BY statements
The workhorses: • Matched merge & concatenation. • Data twothree; merge two three;run; • Forgot BY statement, yields 1 to 1 merge instead of matched merge
LOG: errors, warnings & notes • Unintentional 1 to 1 merge caught with “options mergenoby=error;” • Error messages in the LOG help to find mistakes • Not all problems yield error messages: the note, “repeats of BY values” > many-to-many merge
LOG does not flag all mistakes • A serious problem can occur without anymessage in the LOG • For example, data loss by truncation due to unequal variable lengths
Example: Concatenation • Concatenation of 2 datasets named two and three, each input dataset contains: • A numeric variable named id. • A character variable named var.
Example: variable “var” • two:var and three:var arenot the same • two:var has a length of $2, all values begin with the character “2” • three:var has a length of $3, all values begin with the character “3”
Example: the datastep • data twothree; set two three;run; • Note: the output dataset name indicates the order of input datasets in SET statement.
Example: data loss • In output: twothree:id=1 var =21 and var=31 • In inputs: two:id=1 var=21 three:id=1 var=31a • Value of var in twothree that came from three lacks the ‘a’, Why?
Explanation: • Variable length in output is determined by variable length in the first (leftmost) input dataset in the SET statement, two:var $2. • 31a does not fit into a character variable of length $2, so the ‘a’ is lost through truncation. • LOG does not mention data loss.
Example: another datastep • What would happen if we reverse the order of the datasets in the SET statement? • data threetwo; set three two;run;
Example: Matched merge • Matched merge: data from last dataset overwrites data from first dataset. • Usually re-name variables with same name to avoid overwriting (except BY variables). • Maybe you want matched values to overwrite, so you do not rename.
Example: Matched merge • data twothree; merge two three; by id;run;
Length of variable: • Determined by first, or left most, dataset in MERGE statement. • Again, the LOG is silent on data loss through truncation. • What to do?
Solutions: • Complain to SAS • Until SAS corrects, use macro before merge or concatenation to check for unequal character variable lengths • If found, re-size variables
Macro: verifyVariables.sas • Invoke before merge or concatenation. • %verifyVariables(two,three,set);data twothree; set two three;run;
How does it work? • proc contents on two and three, outputs results to datasets • Merges datasets and uses datastep to find like named character variables • It they have different lengths, puts ERROR statement into log.
Macro code and slides: • On website: http://www.unc.edu/~jcantrel/ • Macro in paper. • John Cantrell University of North Carolina Chapel Hill, NC 27599-7411 (919)843-6495 john_cantrell@unc.edu