510 likes | 784 Views
Analysis of VA Health Care Data with SAS. Paul G. Barnett, PhD September, 2003. Overview of presentation. Use of SAS to analyze VA health care data at Austin Automation Center Suggestions on how to be a more accurate programmer Suggestions on how to document work.
E N D
Analysis of VA Health Care Data with SAS Paul G. Barnett, PhD September, 2003
Overview of presentation • Use of SAS to analyze VA health care data at Austin Automation Center • Suggestions on how to be a more accurate programmer • Suggestions on how to document work
Common SAS procedures used to analyze health care data • DATA step • PROC CONTENTS • PROC MEANS • PROC SUMMARY • PROC FREQ
Definitions • Record– one observation in a dataset, e.g., a hospital stay • Variable- one of the characteristics of that observation, e.g. the length of stay • Database- a collection of records • Data types: character, numeric, SAS date format
DATA step • The procedure to manipulate datasets • Select records • Select variables • Make new variables by transforming pre-existing variables • Combine datasets
Alternatives ways to read data into a DATA step • Read raw data with INPUT • This is to convert text files to SAS • Some Austin financial data sets are text files e.g., CDR or FMS • Read existing SAS files with: • SET • MERGE
DATA step • Read in data • Takes one record at a time • Create new variables from old • KEEP or DROP statement to select variables • IF statement to select observations
Example of DATA step * Create temporary data set; DATA PM01; * Read permanent data set PTF main; SET IN1.PM01; * Select stays of one specific patient; IF SCRSSN in (‘7677569876’); * Keep the following four variables; KEEP SCRSSN STA3N ADMITDAY DISDAY;
Creating new variables * Example of calculation using SAS dates; * Dates are stored as number of days since 1/1/1960; * Find length of hospital stay; LOS=DISDAY-ADMITDAY;
Sub-setting IF statement • Select records in a data step using IF [true-false condition]; • Only records that meet [true-false condition] will be retained • It is also possible to select records without a data step • Can select records in a proc step using WHERE [true-false condition];
IF-THEN-ELSE statement If [true-false condition] then [statement to carry out when condition is true]; Else [statement to carry out when condition is not true]; * Note that you can have an if-then statement without an else statement;
Example of IF-THEN-ELSE statement * Create count of ICU days in bedsection file; * ICUDAYS are equal to the number of days in this bed section if it is the medical (12) or surgical (63) intensive care unit; IF BEDSECN=12 OR BEDSECN=63 THEN ICUDAYS=LSB; ELSE ICUDAYS=0;
IN operator • The IN operator is a shorthand way of stating a series of conditions separated by OR • The following are equivalent IF statments • IF BEDSECN IN (12,63); • IF BEDSECN=12 OR BEDSECN=63; • When data are character, they should be in quotes, e.g. • IF STA6A in (“640CNH”, “640A0 ”);
Conditions that use ICD-9 codes • ICD-9 code has 6 characters • Decimal in ICD-9 code is not recorded • Blanks are on right • Thus code ICD-9 250.1 is stored as “2501_ _”
Substring function • Manipulate character strings • Syntax: • SUBSTR(variable,start,number of chars) • Example • DXLSF=‘2501_ _’ • X=SUBSTR(DXLSF,1,3); • X has the value ‘250’
Use of Substring function to create an indicator • IF SUBSTR(DXLSF,1,3)=‘250’ THEN DIABETES=1; ELSE DIABETES=0; • Checks if DXLSF takes any value beginning with 250, 250.0, 250.1, 250.2, etc. • If condition is true, variable DIABETES is assigned a value of 1, otherwise it is assigned a value of zero.
The “=:” operator • Shorthand for substring function • with starting place at one • for number of characters on the right hand side of comparison • The following conditions are equivalent • SUBSTR(DXLSF,1,3)=‘250’; • DXLSF=:’250’;
Array Statement • A collection of variables • Useful to process series of variables in same fashion • Syntax for array with N members ARRAY Name(N) [Variables to be mapped to the array]; • Example of 10 member array ARRAY DX(10) DXLSF DXF2-DXF10; • Variables can now be referred to by their array name, e.g., DXF3 can also be called DX(3)
Do Statement Useful for a series of statements after a condition is tested IF [condition] THEN DO; [statements]; END;
Other Do Loops • Execute statements as long as condition is met DO WHILE [Condition]; [statements]; END; • Useful to repeatedly execute statements DO I=1 to 10; [statements]; END;
Example of using an array in a do loop Check all 10 diagnoses to see if any of them are the ICD-9 code for diabetes * Initialize variable to zero; DIABETES=0; * Start do loop and run it 10 times; DO X=1 to 10; IF DX(X)=:’250’ THEN DIABETES=1; END; Why can’t an else statement be used to set DIABETES=0?
Combining files: concatenation vs. merge • Concatenation • Stack data sets with different observations described by variables • SET data1 data2 • Merge • Combine datasets with common elements but different variables • MERGE data1 data2 • Requires BY statement
Example of concatenation • Data sets: • Become:
Example of merging • Data sets to be merge BY name: • Become:
MERGE • BY statement may have more than one variable • Datasets must have already been sorted • At least one dataset must have no duplication of BY variable values
Common errors with MERGE to avoid: Many-to-Many merge • One-to-one merge or one-to-many merge are o.k. • Many-to-many merge is not possible
Example of one-to-many merging • Data sets to be merged BY name: • Become:
Impossible to merge many-to-many • Data sets cannot be merged BY name:
Common errors with MERGE to avoid- • Variables in both data sets but with different value • One value will copy over the other value • Lesson: The only variables that are in both datasets should appear in the BY statement • Alternatives: • Rename variables that appear in both datasets or • Use PROC SQL
Manipulating result of merge statement with IN values * IN option indicates which data set record came from; DATA new; MERGE old1 (IN=INOLD1) old2 (IN=IN0LD2); BY NAME IF INOLD1 and INOLD2; * If statement keeps only those records that are in both data sets;
Another example of manipulating result of merge statement * IN option indicates which data set record came from; DATA both oneonly twoonly; MERGE old1 (IN=INOLD1) old2 (IN=IN0LD2); BY NAME IF INOLD1 and INOLD2 THEN OUTPUT both; ELSE IF INOLD1 THEN OUTPUT oneonly; ELSE OUTPUT twoonly; * If-else statements create dataset with records that merged, and datasets of records that didn’t merge;
PROC CONTENTS * Basic information about a new data set; PROC CONTENTS DATA=file;
PROC MEANS • Basic information about a new data set; PROC MEANS DATA=file ;VAR (list of variables); • Can specify particular statistics PROC MEANS DATA=file N MEAN MIN MAX SUM;VAR (list of variables); • Minimum and maximum useful to find unexpected values
PROC PRINT PROC PRINT DATA=source NOOBS; VAR [names of variables to print]; FORMAT variables format; SUM [numeric variables]; • NOOBS option suppresses observation number • Some useful formats: • MMDDYY. To print dates • COMMA12. To print comma delmited numbers, e.g., 21,000,000
PROC SUMMARY • Extremely useful to summarize data in a number of records • Calculate number of ICU days by medical center • Find patients who had a diagnosis of diabetes in any outpatient visit during the year
PROC SUMMARY EXAMPLE * Summarize data in file created from bed section; * NWAY option specifies not to create records with totals or subtotals; PROC SUMMARY DATA=PB01 NWAY; * Summarize by medical center; BY STA3N; * Count of ICU data created in prior data step; VAR ICUDAYS; * Create new data set with one record per med center; OUTPUT OUT=new SUM (ICUDAYS)=ICUDAYS;
PROC SUMMARY EXAMPLE 2 * Examine whether patient every had outpatient visit with diabetes diagnosis; * NWAY option specifies don’t create record with totals; PROC SUMMARY DATA=se NWAY; BY SCRSSN; VAR DIABETES; OUTPUT OUT=new MAX(DIABETES)=DIABETES; *In prior data step, variable diabetes assigned has a value of 0 or 1 based on diagnoses; * We find the maximum value achieved for this patient;
PROC SUMMARY DETAILS • CLASS vs. BY statement • Statements have equivalent effect • BY statement is for large jobs. Must sort data first. • CLASS is for small jobs. No need to presort data. • NWAY option suppresses creation of record with total • and subtotals if there are multiple CLASS variables
PROC SUMMARY details- ID statement • Specifies variable whose value is to be carried forward; here is an appropriate use: • CLASS SCRSSN • ID DOB (birth date) • This is appropriate because every record with the same SCRSSN also has the same birthrate • ID statement should not be used if different records being summarized have different values • Use the ID statement with care!
PROC FREQ • Counts number of records (frequency) with each value of a categorical variable • Can make two-dimensional table for all combinations of two categorical variables
PROC TABULATE • Find sub-total of a numeric variable for each value of a categorical variable
How to be an accurate programmer • Check • Double check • Audit • Write down what you did (Documentation)
Check your work • Check work by hand • Review log file. You should have no error messages or warnings! • Work with a small data set • Use OBS=NN option to limit number of records processed • Print sample records and check that result is what you intended
Double check your work • Double check • Find result by independent method • Use an alternate program- was the result the same? • Audit programming • Have another programmer review your code
Document your work • Use comments * Comment ; /* Block of comments */ • Write comments in program /************************* * Program: OUTCOM1.SAS * Date: September 2, 2003 * Programmer: P. Barnett *************************/
Documentation (cont.) • Use variable labels • Define label when creating a new variable LABEL DOB=“Birth date”; One label statement can label many variables • Use dataset labels • In a Data step that creates a new dataset, use option DATA OUT1.OUTCOM02 (LABEL=“QUERI Outcomes Outpatient Visits FY02”);
Documentation (cont.) • Use easily understood variable and file names • Consider giving program same name as dataset that it creates • Use indentation • to show which “DO” statement matches to which “END” statement • To show which “IF” statements matches to which “ELSE statement
Documentation (cont.) • Keep a list of programs • Program name • Date created • Purpose • Names of source data sets • Names of data sets created • Distinguish “final” programs from draft programs
Documentation (cont.) • You are the audience for your documentation!
How to resolve problems • Break program into many small steps • Print interim results after each small step • Test syntax of program commands with a small invented data set • RTFM (Read the Funny Manual) • If these steps don’t work, ask for help