520 likes | 530 Views
This presentation provides an overview of SQL and how it can be used alongside SAS to complete a themantic project. It covers the basic operations of SQL and demonstrates their application in a step-by-step manner. No prerequisites required.
E N D
SQL and SAS Russell.Hendel@Cms.hhs.gov, Ph.d A.S.A SAS DAY – Oct 31 2007
OVERVIEW - GOALS • What is SQL? • When do you use it? • Virtues of SQL over SAS – readability, ease of learning • Virtues of SAS over SQL – speed,compactness • How can SQL help me when thinking about requirements for projects? (For laypeople) • NO PREREQUISITES NEEDED
METHOD TO PRESENT • We will use a THEMATIC presentation • We will take one project-the THEME • And show how each step can be done in SQL and SAS • Along the way we will learn everything we need to know about SQL and its virtues
THEMATIC PROJECT • START FILE • File with Medicare Eligibles • Contains state, county, codes (SSA and FIPS), zip, gender, agegroup, Pt A entitlement, Pt A term, Pt B entitlement, Part B term, Eligible count for this record
THEMATIC PROJECT • END FILE • Want Subtotals and aggregation: • Official State, County, Code list used in Office of Actuary, along with All Part D eligibles (Entitled to Part A or B) • May also want Part A and B eligibles (Part C) • Goal of presentation: To get FROM the START TO the END
SMALL OVERVIEW OF SQL • What is sql? • SQL is a computer language designed for databases • SQL is a summary of several dozen attempts to create database computer languages • It was found that all such languages had 8 items in common. SQL has these 8 items and is a COMPUTER STANDARD for any database language • Each database language should be SQL compliant (DB2, SAS, ORACLE, etc)
OVERVIEW OF SQL • A database FILE is simply a file with ROWS and COLUMNS • Each ROW is a RECORD • Each COLUMN is a FIELD • The most typical example is say the children in your family • Each child would have Fields associated with them FirstName, Age, School they attend, Birthday etc • This information is STORED in a rectangular table
OVERVIEW OF SQL • A DATABASE is a collection of DATABASE FILES • For example besides your database file for your family you may have a database from the yellow pages which lists for each age group stores carrying clothing, toys and other FIELDS • You can use the DATABASE to find out what stores are good for each child to buy what they want
OVERVIEW OF SQL • We just explained what a DATABASE is • We also gave a simple example • Now we explain what a DATABASE LANGUAGE is • That is we explain what you would want to do with your database files • It turns out there are 8 operations that every database should be able to do • We will give an alternate version of these 8 basic operations that are used in practice
SQL OVERVIEW • A database language should enable you to • 1)Take a SUBSET OF ROWS of a database file (For example: All children in your family under 5) In sql we call this SELECTING • 2) Take a SUBSET OF COLUMNS (e.g. FirstName and School) In SQL we call this PROJECTING • 3) Make a new database file by UNIONing two database files with common fields
SQL OVERVIEW • 4) JOIN two tables by keeping the number of rows the same but adding columns(For example take my family database file and add a column showing the store where to buy clothing for that child) • 5) Suppose you take your FAMILY DATABASE FILE and JOIN to it your EXPENSES for each child(Another database file). • You might want to SUBTOTAL expenses by GENDER
SQL OVERVIEW • In other words: You might want a new table with two rows that tells you BOY – SO MANY $; GIRLS – SO MANY $ • In SQL we refer to these as AGGREGATE FUNCTIONS. • SQL allows 5 types of AGGREGATION: Sum, count, average, min, max
SQL OVERVIEW • SUMMARY: • Every database file should allow you to • SELECT certain rows (e.g. children <5) • PROJECT columns (e.g. School data) • UNION(Add rows) • JOIN(Add columns (e.g. add cloth stores • AGGREGATE FUNCTIONS (sum,…) • The good news: Only need to learn 5 pieces of code – Great advantage of SQL-ease of learning
HOW DO YOU DO A PROJECT • To do a project you • START with certain FIELDS • Decide what FIELDS you want to END • Decide how to GO from START to END • At each step you can do 1 of 5 things • SELECT,PROJECT,JOIN, UNION, AGGREGATE
QUERY OPTIMIZATION • In general each project will allow SEVERAL ways to go from START to END • Database experts have identified rules that OPTIMIZE projects The basic rule is • Do SELECT,PROJECT early • Do JOIN, AGGREGATE later
OUR PROJECT: STEP 1 • Step 1 in ANY project is data cleaning • I have all this data on eligibles. • But is the data all OK • One way of checking is listing the STATES involved. I will then DELETE those states that are BAD • This is a SELECTION (reduction of ROWS) and uses the OPTIMIZATION RULE (SELECT early)
OUR PROJECT STEP 1 • STEP 1 is done the same whether in SAS or SQL. Here is the code. We will explain the underlined keywords • proc sql; • create table d.state as • select distinct substr(sc,1,2) as st • from d.Start; • ;quit; • Run;
OUR PROJECT: Step 2 • The output of the code gives me a list of all states codes in my eligible file • I find two codes I don’t recognize 00 99 • These are codes for badly coded records • About 1% of the records are coded this way • This is actually quite normal • I will now get rid of them using a SELECT
PROJET: STEP 2 SELECT • PROGRAM to delete bad states • proc sql; • create table d.Start2 as • select distinct * • from d.Start • where substr(sc,1,2) not in ('99','00'); • quit;
PROJECT: Step 2 SAS • Here is the SAS code accomplishing the same function – Notice how the SQL and SAS are about the same in readability and compactness • data d.Start2; • set d.Start; • if substr(sc,1,2) in ('99','00') then delete; • run; • SAS in general is usually more compact; SQL is more readable (EXERCISE: Why / How ?)
SQL JOINS - INTERLUDE • PROBLEM: Many versions of SQL • Each one slightly different • Differences lie in JOIN operation • Best not to rely on documentation • Here are some simple programs • We test the 5 types of JOINS • JOIN, INNER JOIN, OUT JOIN, LEFT JOIN and RIGHT JOIN
SQL JOINS • We need two datasets TEMP and TEMP2 each with two fields. Here is TEMP which has 2 columns A,B • data d.temp; • input A B;
SQL JOINS • Here is the CARD statement for TEMP • cards; • 1 1 • 1 2 • 1 3 • Etc. FIELD / COL A can be 1,2,3 • FIELD colum B can be 1,2,3 • We get 9 records
SQL JOIN • Data set TEMP2 • data d.temp2; • input A C; • cards; • 1 7 • 2 8 • 4 9 • ;
SQL JOIN • What should you notice • TEMP.A has 3 values 1,2,3 • TEMP2.A has 3 values 1,2,4 • TEMP.A has values not in TEMP2.A • TEMP2.A has values not in TEMP • There are also common values • Can you the audience name the above
SQL JOIN • What else should you notice • TEMP has COLUMNS A,B • TEMP2 has COLUMNS A,C • The two files have a COMMON COLUMN • The two files also have their own columns • The common column allows a JOIN • Think back to CLOTH stores for KIDS • What was the field we JOINED ON (ANS=)
SQL JOIN • Here is code for an SQL JOIN • proc sql; • create table d.temp3 as • select * • from d.temp join d.temp2 • on temp.A = temp2.A; • quit;
SQL JOIN • The OUTPUT of the JOIN query is a table • A B C Joined • 1 1 7 • 1 2 8 Temp Temp2 • 1 3 9 (1,3) join (1,9) • The table only has A=1,2. A cannot equal 3,4 • The JOIN in SQL SAS takes only A values common to BOTH tables and leaves out A values in only one table
SQL JOIN • In other languages the JOIN behaves differently • When you want only values common to both files you use the INNER JOIN • The INNER JOIN query has identical language to the JOIN query except that the word JOIN is replaced by INNER JOIN. In SAS the output is the same
SQL JOIN • Suppose I wanted all values from my TEMP table whether or not they are linked in my temp2 table • Is this reasonable? • Sure it is. Think back to our example • You want all your children listed WHETHER OR NOT there is store to buy clothing in
SQL JOIN • So you want the A column in the JOIN table to have all A values in the TEMP table whether or not they have values in the TEMP2 table • We call this a LEFT JOIN because in describing the JOIN OF TEMP WITH TEMP2, “TEMP” is on the LEFT and “TEMP2” is on the right. Neat!?!
SQL JOIN • You make a LEFT JOIN query with identical language to a JOIN query except that the word JOIN is replaced by LEFT JOIN • A typical record in the LEFT JOIN query with values from TEMP but not from TEMP2 could look like this • A=3 B=1 C=. • Here the “.” in the C column indicates a MISSING value. • If the field was not numeric the C col is BLANK
SQL JOIN • The RIGHT JOIN query has all A values from the TEMP2 table even if they don’t occur in the TEMP1 table • The language is identical except you use the word RIGHT JOIN • The output has the PERIOD (or blank) for missing values.
PROJECT: Step 3 • So far I removed badly coded states (00,99) • I want my final table to only list states and counties found in the office of the actuary tables for which we have rates • So I have to do some more ROW reduction (SELECTION) • How do I accomplish this
PROJECT STEP 3 • Well I take my START2 table • I will JOIN START2 (add columns) from the Office of Actuary Table • The JOIN column will be _________? • The STATE field (or its code) • What type of JOIN? Well I don’t want anything not in the ACTUARY table and I don’t want anything not in my START2 • So I want an INNER JOIN (or JOIN in SAS)
PROJECT: STEP 3 • The SQL codes is as follows • proc sql; • create table d.START3 as • SELECT * • FROM d.Start2 Inner Join d.Actuary • ON Start2.sc = Actuary.sc; • quit;
PROJECT STEP 3 • I now have in my file • All records with eligible counts • Where the state and county codes are also recognized in the actuary file • I am now in a position to do some more subsetting and get my totals
PROJECT STEP 3 • But am I done with step 3? • Shouldn’t I look at the data deleted • That is: Shouldn’t I look at the records whose county codes arent in the actuary file • I can do this doing some queries as follows
PROJECT: STEP 3 • proc sql; • CREATE table d.Start3LeftOut as • SELECT * • FROM d.Start2 Left Join d.Actuary • ON Start2.sc = Actuary.sc ; CREATE table d.Start3LeftOut as • select * • from d.Start3LeftOut • where d.Start3LeftOut.state = ' ‘; quit;
PROJECT: STEP 3 • IN WORDS • I Left Join my original file with the actuary • I then inspect those records with a blank or period • Those are precisely the records that could not be matched • I now review these codes or subtotal them to find out how bad the data is
PROJECT STEP 3 • Although SQL is READABLE and carries across many platforms, SAS is much more compact. Here is the SAS code to accomplish all the preceding; • SAS however requires sorting the data first • SAS also requires using the IN variable • This is technical sas code which should only be used by people who are familiar with it
PROJECT STEP 3 • proc sort data=d.Actuary;by sc; run; • proc sort data=d.Start2; by sc; run; • data d.Start3 d.Start3a d.Start3b; • merge d.Start3(in=t1) d.Actuary(in=t2); • by sc; • if t1 and t2 then output d.Start3; • if t1 and not t2 then output d.Start3A; • if not t1 and t2 then output d.Start3B; • run;
PROJECT: STEP 4 • What next? • Remember the OPTIMIZE golden rule • SELECT / PROJECT early • GET RID Of unwanted ROWS/COL • We got rid of the unwanted ROWS • So next we get rid of unwanted COLS
PROJECT STEP 4 • SAS CODE (Compact) • data d.Start4; • set d.Start3; • keepsc st county elig fips fg fa fat fb fbt;run; • SQL CODE (Slightly more readable?); • PROC SQL; CREATE TABLE d.Start3 AS • SELECT sc,st ,county ,elig ,fips ,fg ,fa ,fat ,fb ,fbt • FROM d.Start3; QUIT;
PROJECT STEP 5 • Next step is given as an exercise • I explained the importance of LOOKING at the data vs. EXPECTING it TO LOOK a certain way • So I should CHECK on the various possibliities of A,B Entitlement Term; • This is similar to checking on the STATES • It allows me to see all the cases I am revuing • QUIZ: What are the key words to be used?
PROJECT STEP 6 • I have done all my row/col reductions • I am now ready to SUBTOTAL by STATE COUNTY (I don’t want to subtotal by zip) • The sas code is very compact /technical • (Fa, fat=Pt A entitlement flag, Part A “T”erm;) • PROC MEANS Data=d.Start4 noprint; • var elig; • by sc st county fips fa fat fb fbt; • output out=d.Start6 SUM; run; data d.Start6;set d.Start6 ; drop _Type_ _Freq_;
PROJECT STEP 6 • Here is the more readable SQL code • proc sql; • create table d.Start6 as • select sc,st,county,fips,FA,FAT,FB, FBT , ,SUM(Elig) as Elig • from d.Start4 • group by sc,st,county,fips,FA,FAT,FB,FBT • order by elig desc; • Note New keywords (SUM, GROUP BY, ORDER)
PROJECT STEP 7 • What next? Well we have ENTITLEMENT STATUSES and TERM STATUSES • But I want PART D status • Part D = Entitled to A or B • = Pt A status=Yes(1) and Part A term=0 • OR Pt B status=Yes and Part B term=0 • Need code to make a new field • Will only give SQL version since late • Might also want (to add) A and B for which we use AND vs OR
PROJECT: STEP 7 • PROC SQL;create table d.Start7 as • select *, • CASE • WHEN (fa='1' and fat='0') and (fb='1' and fbt='0') • THEN 'A and B' • WHEN (fa='1' and fat='0') or (fb='1' and fbt='0') • THEN 'A or B' • ELSE ' ' • END as STATUS • from d.Start6; QUIT; RUN;
PROJECT COMPLETION • There are two steps left • First: We have to ONLY select rows where the status is A and B or A or B (depending what we want). • Exercise: What is the KEYWORD? What is a previous example? • LASTLY: We have to AGGREGATE AGAIN (Why again???) • What do we GROUP BY? What do we AGGREGATE?