750 likes | 922 Views
Getting data. The first step in data analysis. Learning Objective. What is a relational database? Contact your DBA for how to connect to your database(s)? How to write queries using PROC SQL. Using SAS/BASE® to connect to third-party relational data base software to extract data needed for
E N D
Getting data The first step in data analysis
Learning Objective What is a relational database? Contact your DBA for how to connect to your database(s)? How to write queries using PROC SQL Using SAS/BASE® to connect to third-party relational data base software to extract data needed for • program evaluation • research using administrative data • operational reports e.g. routine surveillance SHRUG, 2014-05-02
What is a relational database? • Set of tables • tables made up of rows and columns • Trade names of relational databases (RDB): • Oracle, Teradata, SQL Server, DB2, Access • RDB is software which is designed to retain large amounts of data • transactional DB • reporting/warehousing DB SHRUG, 2014-05-02
What is a relational database? • Transactional DB designed to increase the speed for front-end users • complex table and table join structures • Warehousing DB designed for efficient storage and retrieval for reporting • simpler table designs and table join structures • Queries for either design use same syntax (code) • queries for warehouses will be simpler to write SHRUG, 2014-05-02
What is a relational database? • Why use relational databases? • relational databases use a concept called “normalization” • Normalization reduces the amount of redundant data and allows for updates to data with less error • There are degrees of normalization • first degree • second degree • third degree and higher degrees SHRUG, 2014-05-02
First degree normalization Table 1: Subjects with demographic information • First degree normalization • each row pertains to a single entity: a patient, an encounter, a physician • each column pertains to a characteristic of the entity: e.g. date of birth, sex, date of encounter, etc SHRUG, 2014-05-02
Violation of first degree normalization Table 1: Subjects with improper 1NF • What impact does violating the first degree normalization have on your query • if you want all patients born in Canada? • if you want all male patients? SHRUG, 2014-05-02
Second degree normalization Table 2: Business addresses • Table 2 has employer information about rows in Table 1 • The table above has some redundant information: • name is repeated from Table 1, province is embedded in the postal code • Better design – two or even 3 tables SHRUG, 2014-05-02
Second degree normalization Table 2: Revised with 2NF Table 3: Creating a secondary table for 2NF SHRUG, 2014-05-02
Second degree normalization • Table 2 now no longer contains name – it’s replaced with the subject ID • to get the subject’s name we link the table to the table in the first example, using SUBJID/ID column • we get the province and city by linking Table 2 and 3 using the POSTALCODE column • SUBJID is a primary key in Tables 1 and 2 • POSTALCODE is a foreign key in Table 2, but a primary key in Table 3 SHRUG, 2014-05-02
Primary/Foreign Keys • primary key – a column or combination of columns that uniquely identify each row in the table • e.g. patient medical record needs at least 3 columns to identify a unique record: patient ID, date of encounter, and provider ID • foreign key – a column or combination of columns that is used to link data between two tables SHRUG, 2014-05-02
Questions about 2NF? • Can you see the advantage of splitting the data into different tables? • share examples of your data where normalization is used • higher degrees of normalization work similarly to the examples above • you have to go through more tables for higher levels of normalization in order to link to the data that you need SHRUG, 2014-05-02
Getting access to data: What do you need from DBA? • Explain to DBA that you need to query data, but have no need to write to the database • this helps them to determine where you belong on a user matrix • DBA or IT install necessary software on your machine • Google has lots of information on SAS Connect • SAS Connect documentation SHRUG, 2014-05-02
How SAS authenticates User name is provided by DBA/IT In this example the password is held in the macro DBPASS Statement to have Oracle print any messages to the SAS log proc sql; connect to oracle (user = <userid> password="&dbpass” path = prod ); %put &sqlxmsg; This is an example of “pass-through” code SHRUG, 2014-05-02
Using a LIBNAME to connect • Recall that slide 13 showed pass-through facility in SAS • most of the query is done on the database • Can use libname statement to connect instead of pass-through • advantage to this method is that you are programming in SAS (using SAS functions and formats) • SAS determines which program (SAS or RDB) will handle statements more efficiently SHRUG, 2014-05-02
Using a LIBNAME to connect 1. 2. 3. • The name of the library • Tells SAS that you are using an ODBC engine • DSN – use the name of the database that was used to set up the odbc connection • NOTE: schema statement is not always required Example using a libname statement: libnameoncoodbcdsn='Oncolog' schema=dbo; SHRUG, 2014-05-02
Seeing your data - Views Once view is created, you use the EXPLORER tab in SAS and use as normal dataset SHRUG, 2014-05-02
Seeing your data - Views Using the “view columns” in SAS EXPLORER SHRUG, 2014-05-02
Seeing your data - Views • Double click on table to get to see the data • NOTE: columns that identify personal information have been removed from this screen shot SHRUG, 2014-05-02
Other ways to view data • You may have software from the RDB: • TOAD (for Oracle) • SQL Developer (for Oracle) • SQL Server • Teradata • All vendors may have some limited function “development” software that allows: • Viewing data • Viewing the “type” of a column: char, num, date, etc. • Writing SQL queries SHRUG, 2014-05-02
Sample view from SQL Developer SHRUG, 2014-05-02
Syntax: Single table - 1 of 2 PROC SQL DATA STEP proc sql; create <table/view> <name> as select <var1> , <var2> , etc from <table/view> where <apply data filters> quit; data <dataset name>; set <dataset> ( keep= <list of variables> where=(<apply filters>)); run; Example: Create a dataset (table) with men aged 50 to 74. Assume the source table is called “demographics” and contains variables: subjectID, age and sex SHRUG, 2014-05-02
Syntax: Single table – 2 of 2 PROC SQL DATA STEP proc sql; create table men5074 as select subjectID , age from work.demographics where sex=‘M’ and age between 50 and 74 ; quit; data men5074 (drop=sex); set work.demographics (keep=subjectid sex age where=(sex='M' and 50<=age<=74)); run; SHRUG, 2014-05-02
Syntax: Multiple tables Request for report received Realize as you look through data elements needed for complete the request that relevant columns reside in two or more tables Background Information • Need to know which tables and which columns are relevant. Useful to have data dictionary, otherwise 3rd party software very helpful • Need to know what filters to apply: sex, time period of interest, diagnosis codes, etc are all commonly applied filters SHRUG, 2014-05-02
Map: Multiple tables • Create a map to guide your query • names of tables that go in ‘FROM’ statement of SQL or ‘SET’ statement in DATA step • names of columns that you need • use meaningful arrows to connect SHRUG, 2014-05-02
Syntax: Multiple table DATA step – 1 of 2 *** sort the first dataset; proc sort data=<dataset1>; by <var(s)>; run; *** sort the second dataset; proc sort data=<dataset2>; by <var(s)>; *** same var(s) as first sort; run; SHRUG, 2014-05-02
Syntax: Multiple table DATA step - 2 of 2 *** find records common to both tables; data <result dataset>; merge <dataset1> (in=in_a) <dataset2> (in=in_b); by <var(s)>; *** we only want a list of records with data in table a AND in table b; if in_a and in_b; run; SHRUG, 2014-05-02
Syntax: Multiple table PROC SQL using temporary tables proc sql feedback; create table <result table name> as select <columns from either or both tables below> from *** temporary table from table1; (select <column(s)> from <table 1> where <apply filter(s)>) a inner join *** temporary table from table2; (select <column(s)> from <table 2> where <apply filter(s)>) b on a.<pk>=b.<pk> ; quit; SHRUG, 2014-05-02
Syntax: Multiple table PROC SQL. “Oracle-style (PL/SQL)” proc sql feedback; create table <result table> as select <columns from one or more tables> from <table1> a , <table2> b where a.pk=b.pk <apply additional filters> ; quit; SHRUG, 2014-05-02
Types of joins RIGHT join – join Table A to B only if an observation exists in Table B SHRUG, 2014-05-02
Compare syntax PROC SQL DATA STEP • no need to sort temporary tables • needed to think about type of join • in this case wanted patients only if they were in both tables • join columns need to be same type but can have different names (slides 6 and 9) • needed to sort data by subjectID • key variable to join demographic to cancersite table • “by” variables need to have same name and type What would you do if you found out that one record in table 1 matched to multiple records in table 2? SHRUG, 2014-05-02
Issues arising from multiple table queries • Table relationships are important: • one-to-one: each record in first table has a maximum of one record in the second table (through primary key) • one-to-many: each record in one table may have multiple rows in second table. Example: • Table 1 contains all patients with a flag indicating whether or not they are “active” • Table 2 contains all GP appointments for each patient • many-to-many SHRUG, 2014-05-02
Examples! Task 1 - single table Task 2 – two tables Task 3 – multiple tables Task 4 – reusing a table multiple times SHRUG, 2014-05-02
Task 1a - Participants Approach 1 Create view of the table required and use SAS to analyze data You are asked to provide a count of the female participants in a cancer screening program who are aged 50 years as of May 31, 2013. Break down the birth dates by month SHRUG, 2014-05-02
Task 1a – background information • demographic data for participants is stored in table “PARTICIPANTS” • sex_cd is a coded variable: 222=F, 223=M, 240=U • birth_dt is the column containing birth dates • although birth_dt appears as a date type column, in SAS Oracle dates are datetime types in SAS • For a participant to be considered 50 years of age on May 31, 2013, their birthday must occur between June 1, 1962 and May 31, 1963 SHRUG, 2014-05-02
Task 1a – Create view using pass-through code proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create view participant as select * from connection to myconn (select * from csprod.participant where sex_cd=222 and trunc(birth_dt) between to_date('19620601','YYYYMMDD‘) and to_date('19630501','YYYYMMDD‘) and del_dt is null ); disconnect from myconn; quit; SHRUG, 2014-05-02
Task 1a – Create view: Parsing the code. Slide 1 of 3 • Create view participant • this syntax translates to “Create a view called ‘participant’ • select * • ‘*’ is a wildcard and means select all • where • “%”, “_” – see Task 1b • multiple (%) or single (_) byte of data, in contrast to the entire column. Only used to scan a column. SHRUG, 2014-05-02
Task 1a – Create view: Parsing the code. Slide 2 of 3 • trunc() • recall that SAS will treat Oracle, Teradata, SQL dates as DATETIME • trunc() is an Oracle function that looks only at the DATE part of the column SHRUG, 2014-05-02
Task 1a – Create view: Parsing the code. Slide 3 of 3 • to_date(‘<yourdate>’, ‘<yourdate format>’) • in this example to_date(‘19620601’, ‘YYYYMMDD’) • take the string 19620601 and treat it as a date with the format YYYYMMDD • could use other formats: YYYYMONDD, MM-DD-YYYY, etc • BETWEEN operator – works as you would expect, includes both lower and upper limits specified SHRUG, 2014-05-02
Task 1a – Contents of view NOTE: The contents show exactly the same columns as slide 31 SHRUG, 2014-05-02
Task 1a – Count of birth month • More detailed analysis uncovered that missing month/day combinations were defaulted to January 1 SHRUG, 2014-05-02
Task 1b - Like operator (1 of 4) REQUEST Find a list of CCI codes for hysterectomy use of single table example of PROC SQL with SAS data filter using “like” operator SHRUG, 2014-05-02
Task 1b – Like operator (2 of 4) SHRUG, 2014-05-02
Task 1b – Like operator (3 of 4) proc sql feedback; create table hystcd as select a.f1 as cci_code , a.f3 as long_desc , substr(a.f1,1,5) as rubric from work.cci_raw as a where upcase(a.f3) like '%HYSTERECTOMY%' or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%UTERUS%') or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%CERVIX%') order by a.f1 ; quit; SHRUG, 2014-05-02
Task 1b – Like operator (4 of 4) RESULTING DATASET SHRUG, 2014-05-02
Task 2 – Participants in the CRC program • Report the number of men and women who turned 60 as of May 31, 2013, enrolled in the colorectal cancer screening program. Do not include participants with unknown sex • participant table contains demographic information: sex, birth date • participant_program table contains data for participants and screening program • program_id=1 indicates colorectal cancer screening program • program_status_cd=263 indicates that a participant is active in the program SHRUG, 2014-05-02
Task 2 – Mapping your query SHRUG, 2014-05-02
Translating your map into sql code - 1 of 3 *** METHOD 2 - Oracle pass through. Simple code; proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create table part60 as select * from connection to myconn (select ptc.gender , count(*) from (select participant_id , sex_cd SHRUG, 2014-05-02
Translating your map into sql code – 2 of 3 *** METHOD 2 - Oracle pass through. Simple code; , case when sex_cd=222 then 'F' else 'M' end as gender from csprod.participant where trunc(birth_dt) between to_date('19520601','YYYYMMDD') and to_date('19530531','YYYYMMDD') and sex_cd <> 240 and del_dt is null) ptc SHRUG, 2014-05-02
Translating your map into sql code – 3 of 3 inner join (select participant_id from csprod.participant_program where program_id=1 and program_status_cd=263 and del_dt is null) pp on ptc.participant_id=pp.participant_id group by ptc.gender ; disconnect from myconn; quit; SHRUG, 2014-05-02