530 likes | 651 Views
Alan Elliott. Dealing with Missing Values. Title "Dealing with Missing Values in SQL"; PROC SQL ; select INC_KEY,GENDER, RACE, INJTYPE, case when ISS=- 81 then . else ISS end as ISS, case when EDGCSTOTAL=- 81 then . else EDGCSTOTAL end as EDGCSTOTAL,
E N D
Dealing with Missing Values Title "Dealing with Missing Values in SQL"; PROCSQL; select INC_KEY,GENDER, RACE, INJTYPE, case when ISS=-81 then . else ISS end as ISS, case when EDGCSTOTAL=-81 then . else EDGCSTOTAL end as EDGCSTOTAL, case when AGE=-81 then . else AGE end as AGE from "C:\sasdata\trauma_sm"; quit; SEE SQLMISSING.SAS
Partial Output Missing Values
Character variables • Take INJTYPE out of initial list of variables on the SELECT statement and add the following code: end as AGE, case when INJTYPE="Burn" then "" else INJTYPE end as INJTYPE • And rerun
Results WAS BURN
Combined CASE / Conditional PROCSQL; select INC_KEY,GENDER,RACE, AGE,DISSTATUS, case when AGE LT 10 and DISSTATUS in ("Dead") then "CHILD DEATH" when AGE GE 10 and DISSTATUS in ("Dead") then "OTHER DEATH" ELSE "Alive" end as TYPEDEATH from "C:\sasdata\trauma_sm"; quit; • SEE SQLCASE.SAS
Order by TYPEDEATH (Descending) Modify the code to read … … end as TYPEDEATH from "C:\sasdata\trauma_sm" ORDER BY TYPEDEATH DESC;
Summarize and Count PROCSQL; select case when AGE LT 10 and DISSTATUS in ("Dead") then "CHILD DEATH" when AGE GE 10 and DISSTATUS in ("Dead") then "OTHER DEATH" ELSE "Alive" end as TYPEDEATH, count(calculated TYPEDEATH) as COUNTDEATH from "C:\sasdata\trauma_sm" GROUP BY TYPEDEATH; quit; SEE SQLSUMMARY1.SAS (NOTE: Take calculates out of above statement, and observe error.)
Reorder the table Change the code (at end to read) from "C:\sasdata\trauma_sm" GROUP BY TYPEDEATH order by typedeath;
SQL Summary Functions Summary Function Description AVG, MEAN Average or mean of values COUNT, FREQ, N Aggregate number of non-missing values CSS Corrected sum of squares CV Coefficient of variation MAX Largest value MIN Smallest value NMISS Number of missing values PRT Probability of a greater absolute value of Student’s t RANGE Difference between the largest and smallest values STD Standard deviation STDERR Standard error of the mean SUM Sum of values SUMWGT Sum of the weight variable values which is 1 T Testing the hypothesis that the population mean is zero USS Uncorrected sum of squares VAR Variance
Using Some Summary Functions procsql; select count(brand) as Tot_Cars, sum(minivan) as TOT_Minivans, min(CITYMPG) as MIN_MPG, max(CITYMPG) as MAX_MPG, SUM(CITYMPG)/COUNT(CITYMPG) as AVG_MPG from sasdata.cars; quit; SEE SQLSUMMARY2.SAS
Add group statement from sasdata.cars group by minivan; quit;
Compare Sort (Data Step) DATA SORT DATA MYDATA; INPUT @1 LAST $20. @21 FIRST $20. @45 PHONE $12.; Label LAST = 'Last Name' FIRST = 'First Name' PHONE = 'Phone Number'; DATALINES; Reingold Lucius 201-876-0987 Jones Pam 987-998-2948 Etc… ; *-------- Modify to sort by first name within last (by last first); PROCSORT; BY LAST FIRST; PROCPRINT LABEL NOOBS; TITLE 'ABC Company'; TITLE2 'Telephone Directory'; RUN; RUN this code, observe results. SEE DATASORT.SAS
Sort Using SQL PROCSQL; SELECT LAST LABEL="Last Name", FIRST LABEL="First Name", PHONE LABEL "Phone Number" from MYDATA ORDER by LAST, FIRST; QUIT; SEE SQLSORT.SAS Note – variables appear in table in order selected…
Results for SQL Sort Note – easier to order variable names.
Appending/Concatenating Two Files • Recall from the Data Step, to append two data files you can use the code DATA NEW; SET OLD1 OLD2; RUN; (See SQLAPPEND.SAS)
Append Files using SQL PROCSQL; select * from old1 union select * from old2; QUIT; UNION-means concatenate the query results. It produces all the unique rows that result from both queries
Basic SQL Operators Combine two or more queries in various ways by using the following set operators:
Duplicate records • Suppose there are duplicate records. • See SQLAPPEND2.SAS DATAOLD1; INPUT SUBJ $ AGE YRS_SMOKE; datalines; 001 34 12 003 44 14 004 55 35 006 21 3 011 33 11 ; DATA OLD2; INPUT SUBJ $ AGE YRS_SMOKE MARRIED; datalines; 006 21 3 . 011 33 11 1 012 25 19 0 023 65 45 1 032 71 55 1 ; RUN; This record added This record added
Union appends, keeps unique rows PROCSQL; select * from old1 union select * from old2; QUIT; One row 6 is unique. Two row 11’s are unique. (UNION keeps all unique.) This is the same code from before – only difference is the duplicated records in the data sets.
Union all • To keep all rows, use UNION ALL PROCSQL; select * from old1 union all select * from old2; QUIT; Add ALL to the code and re-run.
Results – Union All Both SUBJ 6 records included (even though not unique).
EXCEPT • To keep only the data from the first data set that are not in the 2nd set (but all variables) use EXCEPT PROCSQL; select * from old1 except select * from old2; QUIT; Run this code and observe output.
Except Output Note – Record 6 is the same, so it is not kept. Record 11 is different so it is kept.
Switch Data Set order PROCSQL; select * from old2 except select * from old1; QUIT; Run this code and observe output.
Output Note – (Same as before) Record 6 is the same, so it is not kept. Record 11 is different so it is kept.
Except ALL datalines; 001 34 12 003 44 14 004 55 35 006 21 3 006 21 3 011 33 11 ; Suppose there was a duplicate record 006 in the first data set. Using EXCEPT, record 6 would not appear in the result because there is a record 6 in the second data set. If you want a duplicate record that is not a duplicate matched in the 2nd data set to appear in the result, use EXCEPT ALL See SQLAPPEND2a.SAS Added duplicate record
Except ALL PROCSQL; select * from old1 except all select * from old2; QUIT; 6 is a duplicate record in OLD1 and there is not a duplicate 6 in OLD 2, so it appears in the results
Intersect • The INTERSECT command returns only those records that occur in both data sets. • Change EXCEPT ALL to INTERSECT PROCSQL; select * from old1 intersect select * from old2; QUIT;
Intersect Results Only record 6 was duplicated in both data sets.
Compare Union with Outer Union • UNION- produces all unique rows from both queries. • OUTER UNION – concatenates the series results • See SQLAPPEND3.SAS
Compare Output PROCSQL; select * from old1 UNION select * from old2; QUIT; PROCSQL; select * from old1 OUTER UNION select * from old2; QUIT;
Results of Outer Union Note: SQL allows you to create a data set with DUPLICATE variable names.
Cartesian Join • Combines ALL rows from one file with ALL rows from another. PROCSQL; select * from old1, old2; QUIT; SEE SQLCARTESIAN_JOIN.SAS
Cartesian Join Note SUBJ 1 appears 5 times
Using Table Aliases select a.subj, a.age, b.subj as sub_from_b, etc… from old1 a, old2 b Table Alias allows you to distinguish variables from different tables without ambiguity. Note “a” variables prefix specifies that the variable is from the table “old1” since “old1” is labeled as the “a” table below. Table old1 is labeled as table “a” in this code.
Inner Join (Using Table Alias) PROCSQL; select a.subj, a.age, b.subj, b.age, b.married as Married from old1 a, old2 b where a.subj=b.subj; QUIT; In an INNER JOIN, only observations with both key values matching are selected. SEE SQL_INNER_JOIN.SAS
Inner Join Code PROCSQL; select a.subj, a.age, b.subj, b.age, b.married as Married from old1 a, old2 b where a.subj=b.subj; QUIT; Note Married is from table “b” but will be called Married in output “Where” limits the join to those that are in BOTH Tables
Inner Join (1-to-1 merge) Example 2 PROCSQL; select * from old1, old2 where old1.subj=old2.subect; QUIT; Use the table names as the alias. SEE SQL_INNER_JOIN2.SAS
Results Note – data for both files where not in order. Only those in both tables with a matching key variable are included in the result. EXERCISE – Add the phrase “order by old1.subj” to put the table in Subject order.
One to Many Merge Suppose you have data like this… you want to match building to employees. DataLOC; input BUILDING $ Location $; datalines; A1 DALLAS A2 WACO A3 HOUSTON ; RUN; DATA EMPLOYEE; input EID $ LOC $ ROOMNUMBER; datalines; 001 A2 103 003 A1 100 005 A1 1001 006 A3 12 002 A1 101.1 ; run;