1 / 52

SQL and SAS

This presentation provides an overview of SQL and how it can be used alongside SAS to complete a themantic project. It covers the basic operations of SQL and demonstrates their application in a step-by-step manner. No prerequisites required.

paular
Download Presentation

SQL and SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SQL and SAS Russell.Hendel@Cms.hhs.gov, Ph.d A.S.A SAS DAY – Oct 31 2007

  2. OVERVIEW - GOALS • What is SQL? • When do you use it? • Virtues of SQL over SAS – readability, ease of learning • Virtues of SAS over SQL – speed,compactness • How can SQL help me when thinking about requirements for projects? (For laypeople) • NO PREREQUISITES NEEDED

  3. METHOD TO PRESENT • We will use a THEMATIC presentation • We will take one project-the THEME • And show how each step can be done in SQL and SAS • Along the way we will learn everything we need to know about SQL and its virtues

  4. THEMATIC PROJECT • START FILE • File with Medicare Eligibles • Contains state, county, codes (SSA and FIPS), zip, gender, agegroup, Pt A entitlement, Pt A term, Pt B entitlement, Part B term, Eligible count for this record

  5. THEMATIC PROJECT • END FILE • Want Subtotals and aggregation: • Official State, County, Code list used in Office of Actuary, along with All Part D eligibles (Entitled to Part A or B) • May also want Part A and B eligibles (Part C) • Goal of presentation: To get FROM the START TO the END

  6. SMALL OVERVIEW OF SQL • What is sql? • SQL is a computer language designed for databases • SQL is a summary of several dozen attempts to create database computer languages • It was found that all such languages had 8 items in common. SQL has these 8 items and is a COMPUTER STANDARD for any database language • Each database language should be SQL compliant (DB2, SAS, ORACLE, etc)

  7. OVERVIEW OF SQL • A database FILE is simply a file with ROWS and COLUMNS • Each ROW is a RECORD • Each COLUMN is a FIELD • The most typical example is say the children in your family • Each child would have Fields associated with them FirstName, Age, School they attend, Birthday etc • This information is STORED in a rectangular table

  8. OVERVIEW OF SQL • A DATABASE is a collection of DATABASE FILES • For example besides your database file for your family you may have a database from the yellow pages which lists for each age group stores carrying clothing, toys and other FIELDS • You can use the DATABASE to find out what stores are good for each child to buy what they want

  9. OVERVIEW OF SQL • We just explained what a DATABASE is • We also gave a simple example • Now we explain what a DATABASE LANGUAGE is • That is we explain what you would want to do with your database files • It turns out there are 8 operations that every database should be able to do • We will give an alternate version of these 8 basic operations that are used in practice

  10. SQL OVERVIEW • A database language should enable you to • 1)Take a SUBSET OF ROWS of a database file (For example: All children in your family under 5) In sql we call this SELECTING • 2) Take a SUBSET OF COLUMNS (e.g. FirstName and School) In SQL we call this PROJECTING • 3) Make a new database file by UNIONing two database files with common fields

  11. SQL OVERVIEW • 4) JOIN two tables by keeping the number of rows the same but adding columns(For example take my family database file and add a column showing the store where to buy clothing for that child) • 5) Suppose you take your FAMILY DATABASE FILE and JOIN to it your EXPENSES for each child(Another database file). • You might want to SUBTOTAL expenses by GENDER

  12. SQL OVERVIEW • In other words: You might want a new table with two rows that tells you BOY – SO MANY $; GIRLS – SO MANY $ • In SQL we refer to these as AGGREGATE FUNCTIONS. • SQL allows 5 types of AGGREGATION: Sum, count, average, min, max

  13. SQL OVERVIEW • SUMMARY: • Every database file should allow you to • SELECT certain rows (e.g. children <5) • PROJECT columns (e.g. School data) • UNION(Add rows) • JOIN(Add columns (e.g. add cloth stores • AGGREGATE FUNCTIONS (sum,…) • The good news: Only need to learn 5 pieces of code – Great advantage of SQL-ease of learning

  14. HOW DO YOU DO A PROJECT • To do a project you • START with certain FIELDS • Decide what FIELDS you want to END • Decide how to GO from START to END • At each step you can do 1 of 5 things • SELECT,PROJECT,JOIN, UNION, AGGREGATE

  15. QUERY OPTIMIZATION • In general each project will allow SEVERAL ways to go from START to END • Database experts have identified rules that OPTIMIZE projects The basic rule is • Do SELECT,PROJECT early • Do JOIN, AGGREGATE later

  16. OUR PROJECT: STEP 1 • Step 1 in ANY project is data cleaning • I have all this data on eligibles. • But is the data all OK • One way of checking is listing the STATES involved. I will then DELETE those states that are BAD • This is a SELECTION (reduction of ROWS) and uses the OPTIMIZATION RULE (SELECT early)

  17. OUR PROJECT STEP 1 • STEP 1 is done the same whether in SAS or SQL. Here is the code. We will explain the underlined keywords • proc sql; • create table d.state as • select distinct substr(sc,1,2) as st • from d.Start; • ;quit; • Run;

  18. OUR PROJECT: Step 2 • The output of the code gives me a list of all states codes in my eligible file • I find two codes I don’t recognize 00 99 • These are codes for badly coded records • About 1% of the records are coded this way • This is actually quite normal • I will now get rid of them using a SELECT

  19. PROJET: STEP 2 SELECT • PROGRAM to delete bad states • proc sql; • create table d.Start2 as • select distinct * • from d.Start • where substr(sc,1,2) not in ('99','00'); • quit;

  20. PROJECT: Step 2 SAS • Here is the SAS code accomplishing the same function – Notice how the SQL and SAS are about the same in readability and compactness • data d.Start2; • set d.Start; • if substr(sc,1,2) in ('99','00') then delete; • run; • SAS in general is usually more compact; SQL is more readable (EXERCISE: Why / How ?)

  21. SQL JOINS - INTERLUDE • PROBLEM: Many versions of SQL • Each one slightly different • Differences lie in JOIN operation • Best not to rely on documentation • Here are some simple programs • We test the 5 types of JOINS • JOIN, INNER JOIN, OUT JOIN, LEFT JOIN and RIGHT JOIN

  22. SQL JOINS • We need two datasets TEMP and TEMP2 each with two fields. Here is TEMP which has 2 columns A,B • data d.temp; • input A B;

  23. SQL JOINS • Here is the CARD statement for TEMP • cards; • 1 1 • 1 2 • 1 3 • Etc. FIELD / COL A can be 1,2,3 • FIELD colum B can be 1,2,3 • We get 9 records

  24. SQL JOIN • Data set TEMP2 • data d.temp2; • input A C; • cards; • 1 7 • 2 8 • 4 9 • ;

  25. SQL JOIN • What should you notice • TEMP.A has 3 values 1,2,3 • TEMP2.A has 3 values 1,2,4 • TEMP.A has values not in TEMP2.A • TEMP2.A has values not in TEMP • There are also common values • Can you the audience name the above

  26. SQL JOIN • What else should you notice • TEMP has COLUMNS A,B • TEMP2 has COLUMNS A,C • The two files have a COMMON COLUMN • The two files also have their own columns • The common column allows a JOIN • Think back to CLOTH stores for KIDS • What was the field we JOINED ON (ANS=)

  27. SQL JOIN • Here is code for an SQL JOIN • proc sql; • create table d.temp3 as • select * • from d.temp join d.temp2 • on temp.A = temp2.A; • quit;

  28. SQL JOIN • The OUTPUT of the JOIN query is a table • A B C Joined • 1 1 7 • 1 2 8 Temp Temp2 • 1 3 9 (1,3) join (1,9) • The table only has A=1,2. A cannot equal 3,4 • The JOIN in SQL SAS takes only A values common to BOTH tables and leaves out A values in only one table

  29. SQL JOIN • In other languages the JOIN behaves differently • When you want only values common to both files you use the INNER JOIN • The INNER JOIN query has identical language to the JOIN query except that the word JOIN is replaced by INNER JOIN. In SAS the output is the same

  30. SQL JOIN • Suppose I wanted all values from my TEMP table whether or not they are linked in my temp2 table • Is this reasonable? • Sure it is. Think back to our example • You want all your children listed WHETHER OR NOT there is store to buy clothing in

  31. SQL JOIN • So you want the A column in the JOIN table to have all A values in the TEMP table whether or not they have values in the TEMP2 table • We call this a LEFT JOIN because in describing the JOIN OF TEMP WITH TEMP2, “TEMP” is on the LEFT and “TEMP2” is on the right. Neat!?!

  32. SQL JOIN • You make a LEFT JOIN query with identical language to a JOIN query except that the word JOIN is replaced by LEFT JOIN • A typical record in the LEFT JOIN query with values from TEMP but not from TEMP2 could look like this • A=3 B=1 C=. • Here the “.” in the C column indicates a MISSING value. • If the field was not numeric the C col is BLANK

  33. SQL JOIN • The RIGHT JOIN query has all A values from the TEMP2 table even if they don’t occur in the TEMP1 table • The language is identical except you use the word RIGHT JOIN • The output has the PERIOD (or blank) for missing values.

  34. PROJECT: Step 3 • So far I removed badly coded states (00,99) • I want my final table to only list states and counties found in the office of the actuary tables for which we have rates • So I have to do some more ROW reduction (SELECTION) • How do I accomplish this

  35. PROJECT STEP 3 • Well I take my START2 table • I will JOIN START2 (add columns) from the Office of Actuary Table • The JOIN column will be _________? • The STATE field (or its code) • What type of JOIN? Well I don’t want anything not in the ACTUARY table and I don’t want anything not in my START2 • So I want an INNER JOIN (or JOIN in SAS)

  36. PROJECT: STEP 3 • The SQL codes is as follows • proc sql; • create table d.START3 as • SELECT * • FROM d.Start2 Inner Join d.Actuary • ON Start2.sc = Actuary.sc; • quit;

  37. PROJECT STEP 3 • I now have in my file • All records with eligible counts • Where the state and county codes are also recognized in the actuary file • I am now in a position to do some more subsetting and get my totals

  38. PROJECT STEP 3 • But am I done with step 3? • Shouldn’t I look at the data deleted • That is: Shouldn’t I look at the records whose county codes arent in the actuary file • I can do this doing some queries as follows

  39. PROJECT: STEP 3 • proc sql; • CREATE table d.Start3LeftOut as • SELECT * • FROM d.Start2 Left Join d.Actuary • ON Start2.sc = Actuary.sc ; CREATE table d.Start3LeftOut as • select * • from d.Start3LeftOut • where d.Start3LeftOut.state = ' ‘; quit;

  40. PROJECT: STEP 3 • IN WORDS • I Left Join my original file with the actuary • I then inspect those records with a blank or period • Those are precisely the records that could not be matched • I now review these codes or subtotal them to find out how bad the data is

  41. PROJECT STEP 3 • Although SQL is READABLE and carries across many platforms, SAS is much more compact. Here is the SAS code to accomplish all the preceding; • SAS however requires sorting the data first • SAS also requires using the IN variable • This is technical sas code which should only be used by people who are familiar with it

  42. PROJECT STEP 3 • proc sort data=d.Actuary;by sc; run; • proc sort data=d.Start2; by sc; run; • data d.Start3 d.Start3a d.Start3b; • merge d.Start3(in=t1) d.Actuary(in=t2); • by sc; • if t1 and t2 then output d.Start3; • if t1 and not t2 then output d.Start3A; • if not t1 and t2 then output d.Start3B; • run;

  43. PROJECT: STEP 4 • What next? • Remember the OPTIMIZE golden rule • SELECT / PROJECT early • GET RID Of unwanted ROWS/COL • We got rid of the unwanted ROWS • So next we get rid of unwanted COLS

  44. PROJECT STEP 4 • SAS CODE (Compact) • data d.Start4; • set d.Start3; • keepsc st county elig fips fg fa fat fb fbt;run; • SQL CODE (Slightly more readable?); • PROC SQL; CREATE TABLE d.Start3 AS • SELECT sc,st ,county ,elig ,fips ,fg ,fa ,fat ,fb ,fbt • FROM d.Start3; QUIT;

  45. PROJECT STEP 5 • Next step is given as an exercise • I explained the importance of LOOKING at the data vs. EXPECTING it TO LOOK a certain way • So I should CHECK on the various possibliities of A,B Entitlement Term; • This is similar to checking on the STATES • It allows me to see all the cases I am revuing • QUIZ: What are the key words to be used?

  46. PROJECT STEP 6 • I have done all my row/col reductions • I am now ready to SUBTOTAL by STATE COUNTY (I don’t want to subtotal by zip) • The sas code is very compact /technical • (Fa, fat=Pt A entitlement flag, Part A “T”erm;) • PROC MEANS Data=d.Start4 noprint; • var elig; • by sc st county fips fa fat fb fbt; • output out=d.Start6 SUM; run; data d.Start6;set d.Start6 ; drop _Type_ _Freq_;

  47. PROJECT STEP 6 • Here is the more readable SQL code • proc sql; • create table d.Start6 as • select sc,st,county,fips,FA,FAT,FB, FBT , ,SUM(Elig) as Elig • from d.Start4 • group by sc,st,county,fips,FA,FAT,FB,FBT • order by elig desc; • Note New keywords (SUM, GROUP BY, ORDER)

  48. PROJECT STEP 7 • What next? Well we have ENTITLEMENT STATUSES and TERM STATUSES • But I want PART D status • Part D = Entitled to A or B • = Pt A status=Yes(1) and Part A term=0 • OR Pt B status=Yes and Part B term=0 • Need code to make a new field • Will only give SQL version since late • Might also want (to add) A and B for which we use AND vs OR

  49. PROJECT: STEP 7 • PROC SQL;create table d.Start7 as • select *, • CASE • WHEN (fa='1' and fat='0') and (fb='1' and fbt='0') • THEN 'A and B' • WHEN (fa='1' and fat='0') or (fb='1' and fbt='0') • THEN 'A or B' • ELSE ' ' • END as STATUS • from d.Start6; QUIT; RUN;

  50. PROJECT COMPLETION • There are two steps left • First: We have to ONLY select rows where the status is A and B or A or B (depending what we want). • Exercise: What is the KEYWORD? What is a previous example? • LASTLY: We have to AGGREGATE AGAIN (Why again???) • What do we GROUP BY? What do we AGGREGATE?

More Related