690 likes | 988 Views
Ensuring Data Integrity with PROC COMPARE and Beyond. Scott A. Miller May 21, 2018 (Iowa SAS Users Group) May 22, 2018 (Nebraska SAS Users Group). Why Compare Data Sets?. Fixing a Bug Did we actually fix the bug? Did we create new problems in the process? Reconciliation / Validation
E N D
Ensuring Data Integrity with PROC COMPARE and Beyond Scott A. Miller May 21, 2018 (Iowa SAS Users Group)May 22, 2018 (Nebraska SAS Users Group)
Why Compare Data Sets? • Fixing a Bug • Did we actually fix the bug? • Did we create new problems in the process? • Reconciliation / Validation • Do the values in the new system match the old system? • Do the values in our report match the original source?
Benefits of Comparison • Catch problems before rollout • Documentation for Change Management • Facilitate thorough testing
Basic Syntax • This will compare all “matching” variables in both data sets. • The Base data set will be listed first in reports. proccomparebase=DSNAME1 compare=DSNAME2; run;
What is a matching variable? • To be a “match” for PROC COMPARE… • Must have the same name • Must have the same data type • NUMERIC or CHARACTER • Not required: • LENGTH • LABEL • FORMAT
Example #1 • Three changes: • Dropped color (red) • Added miles (green) • Reduced prices (blue)
Variables not in both data sets • By default, • By default, all you get is a count • Would you like to see the variable names? • Use the LISTVAR option
Variables not in both data sets proccomparebase=cars1 compare=cars2 listvar; run;
Variable names don’t match? • Just rename them with a data set option. proccomparebase=cars2 compare=cars2b (rename=(description=desc lstprice=price)); run;
Example #2 • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)
What happened? • By default, rows are compared side-by-side • What do we want?
What do we want • By default, rows are compared side-by-side • What do we want? • Compare by key value
ID Statement • With the ID statement, records with matching ID values will be compared. • Ideally, use a key that is unique and immutable proccomparebase=cars3 compare=cars4; id VIN; run;
ID Statement WARNING: The data set WORK.CARS3 is not sorted by the ID variables. Observations will be matched using NOTSORTED logic with the assumption that the observations correspond one-to-one. NOTE: At observation 2 the current and previous ID values are: VIN=1111. VIN=5555. ERROR: The ID variable values do not match at observation 2 in the base data set WORK.CARS3 and observation 2 in the comparison data set WORK.CARS4. (When one or both data sets are not sorted by the ID variables, or when NOTSORTED is specified, the observations must match one-to-one.) • The simple solution is to sort the data sets.
ID Statement procsortdata=cars3; by VIN; run; procsortdata=cars4; by VIN; run; proccomparebase=cars3 compare=cars4; id VIN; run;
Example #2 (reminder) • Two changes: • Dropped 2010 BMW M3 (red) • Added 2002 Porsche 911 (green)
Example #2: Report (using ID) • This part tells us 1 observation added, 1 observation deleted • This part tells us all observations with matching IDs are identical
Going Beyond the Basics • What we’ve covered up to this point will provide most of what you need to get started with PROC COMPARE. • Now, let’s cover some issues you may run into, and their solutions.
Truncated Numbers • Let’s compare these two data sets: • Identical, except ending in “.11” vs “.22”
Truncated Numbers • We are missing the last two digits
FORMAT Statement • We can specify a format value comparison report. • Caveats: • Limited to space available -- 10 characters max * • This is not big enough for DATETIME • Display the time only portion with either timeampm10. or tod10. • * using SAS 9.4 M4 proccomparebase=m1 compare=m2; format num1 best10. num2 10.8; run;
Truncated Numbers • Report now shows last two digits
MAXPRINT option • By default, the variable comparison report shows… • Max of 50 differences per variable. • Max of 500 total differences total. • You can override the defaults with MAXPRINT. proccomparebase=ds1 compare=ds2 maxprint= (# per variable, # total); run;
MAXPRINT option • If you have a lot variables: • If you want a lot of examples for one variable: proccomparebase=ds1 compare=ds2 maxprint=(10, 500); run; proccomparebase=ds1 (keep=varname1)compare=ds2 (keep=varname1) maxprint=(250, 250); run;
Same but different • Sometimes PROC COMPARE will flag two numbers as different even though they look the same. • Consider these two equations: • X = A * B * C • X = A * (B * C) • We know that mathematically, these are identical
Same but different • As expected, m1 and m2 are identical data m1; set numbers; x= a * b * c; run; data m2; set numbers; x= a *(b * c); run; numbers
Same but different • What happens when we use PROC COMPARE?
Same but different • What is going on here? • A difference of -1.46E-11 ? • -0.0000000000146 • Root cause: binary approximation of floating point • Beyond the scope of this presentation
Same but different • Many things can trigger this small difference • Rounding won’t always fix this • Real problems get buried in the noise • We need to be able to ignore small differences!
METHOD parameter • We can ignore small differences with the METHOD parameter • EXACT (default) • ABSOLUTE – will ignore differences smaller than the value specified by CRITERION • PERCENT and RELATIVE – threshold is based on the magnitude of value that is changing
Same but different • This code will ignore differences smaller than 0.01 proccomparebase=m1 compare=m2 method=absolutecriterion=0.01; run;
Going Beyond PROC COMPARE • PROC COMPARE can do some amazing things, but has its limitations • We can also compare data sets using our own code • Standard DATA steps • PROC SQL
Find Unmatched Observations • Finding Unmatched Observations is a frequently required task • Especially useful with interdependent variables • We can print samples from both data sets with all variables • Use this to determine root cause
Find Unmatched Observations dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Replace unq_id with variable(s) that are unique ID. • Both data sets must be sorted by this unique ID
How Does This Work? dataunmatched_aunmatched_b; merge data1 (in=in_a) data2 (in=in_b); byunq_id _ALL_; if(notin_b)thenoutputunmatched_a; elseif(not in_a)thenoutputunmatched_b; run; • Observations that exactly match on all vars in BY statement are merged • _ALL_ means all variables • Unmatched records are output
Find Unmatched Observations • So which observations would be output?
Find Unmatched Observations • So which observations would be output? • unmatched_a has 3 observations • unmatched_b has 2 observations
Find Unmatched Observations • We’ve dropped var4 from second data set. • Now which observations would be output?
Find Unmatched Observations • The variable lists must match • Use KEEP or DROP • Protip: can automate with SQL dictionary table and macro variable ERROR: BY variable var4 is not on input data set WORK.DATA2. NOTE: The SAS System stopped processing this step because of errors.
Quantify the change • PROC COMPARE is good at telling you… • Is it different? • Is it the same? • What if you want to know… • Was the change significant? • Did things get better or worse?
Quantify the change • We have two data sets with a thousand scores… • What can we say about these scores?