300 likes | 507 Views
SQL 101: An Introduction to writing queries with SQL at the University of Minnesota (Day 3). Peter M. Radcliffe Senior Analyst College of Liberal Arts March 8 th , 2007. Agenda. Advanced Topics in SQL Queries Having clause Case/Decode statement Subqueries Wrap-up Q&A. SQL Skeleton.
E N D
SQL 101:An Introduction to writing queries with SQL at the University of Minnesota (Day 3) Peter M. Radcliffe Senior Analyst College of Liberal Arts March 8th, 2007
Agenda • Advanced Topics in SQL Queries • Having clause • Case/Decode statement • Subqueries • Wrap-up Q&A
SQL Skeleton • Recall our skeleton SQL Query from previous classes: SELECT [*, columns, or aggregate functions] FROM [table name(s)] ON statement [criteria for joining tables] WHERE [conditions, including comparison functions =,<,<>,etc.] AND/OR/IN statements [additional conditions] IS (NOT) NULL statements [working with null values] GROUP BY [columns] HAVING [conditions, including comparison functions] ORDER BY [columns]
HAVING Clause • HAVING is the equivalent of the WHERE clause for aggregates • While the WHERE clause operates on the individual rows of data, HAVING operates on the aggregates created in the SELECT clause • HAVING can use the same relational arguments that can be used in the WHERE clause • HAVING can use multiple criteria joined by AND & OR statements, just as in the WHERE clause
HAVING Examples • To check for groups with multiple records: • HAVING count(*) > 1 • To find students who have enrolled for up to 4 terms • HAVING count(term) <= 4 • To find courses with high average grades • HAVING avg(grd_pts_per_unit) > 3.5 • To find students with a low probability of retention • HAVING avg(retained) < .75 • Retained in this example is a calculated 0/1 column
Example Query with HAVING • Departments with large numbers of PO transactions SELECT a.area_class, a.area, c.area_name30, count(*) AS records, sum(a.dollar_amt) AS total FROM dwfs_tran_hist a INNER JOIN dwfs_coa_area c ON a.area = c.area WHERE a.area_class = 'TCLA‘ AND a.fisc_yr = '2007’ AND a.trans_cd = 'PO’ AND a.acct_type = '21' GROUP BY a.area_class, a.area, c.area_name30 HAVING count(*) > 100 ORDER BY a.area_class, a.area, c.area_name30
CASE Statement • CASE statements replace existing values from a column with new values supplied in the CASE syntax • Since the Data Warehouse is a read-only environment, the values in the tables themselves cannot be changed, rather the results produced by the query are changed • Typically, the modified values are placed in a new column, but it is possible to assign the new column the same name as the old one, effectively substituting the new column for the old one
CASE Statement • CASE statements can be used in several different ways: • To “recode” values in a column to a desired new value • To assign values to missing data (similar to NVL command) or set valid values to missing (e.g. change 0 to NULL) • To assign labels to values (e.g. change the “010” acad_level code to “Freshman”) • To collapse values into a smaller set for grouping (e.g. set both “010” and “015” acad_level codes to “Freshman”) • In conjunction with aggregation commands to create conditional sums, averages, and counts
CASE Syntax Each CASE statement has the following components: • CASE (to invoke the command) • The name of the source column • WHEN (to identify the value or values to change) • The actual value or range of values to change • THEN (to identify the new values) • The actual new values to replace the old values • ELSE (optional, to handle values not listed in WHEN) • The actual value to assign when ELSE condition used • END (to terminate the command) • AS [new_column_name] to label the new column
CASE Syntax • There are two versions of the syntax for CASE statements, depending on the number of values being recoded • When a single value is being changed, the syntax is: CASE column WHEN old_value THEN new_value ELSE else_value END AS new_name • When a set of values or a more complex condition is being used, the syntax is: CASE WHEN column {operator} old_value THEN new_value ELSE else_value END AS new_name
CASE Syntax Examples • Single Value: CASE acad_level_bot WHEN ‘010’ THEN ‘Freshman’ ELSE ‘Other’ END AS freshman_id • Multiple Values CASE WHEN acad_level_bot IN (‘010’,’015’) THEN ‘Freshman’ ELSE ‘Other’ END AS freshman_id
CASE: Multiple Recodes • Frequently, there are multiple values in the original column to be set to new values in the new column • To accomplish this, repeat the WHEN and THEN commands, but keep single CASE, ELSE, and END commands • Order of WHEN and column name still depends on the complexity of the conditions used to identify the target values in the original column
CASE: Multiple Recode Example 1 CASE acad_level_bot WHEN ‘010’ THEN ‘Freshman’ WHEN ‘015’ THEN ‘Freshman’ WHEN ‘020’ THEN ‘Sophomore’ WHEN ‘025’ THEN ‘Sophomore’ WHEN ‘030’ THEN ‘Junior’ WHEN ‘035’ THEN ‘Junior’ WHEN ‘040’ THEN ‘Senior’ WHEN ‘045’ THEN ‘Senior’ ELSE ‘Error’ END AS acad_level
CASE: Multiple Recode Example 2 CASE WHEN a.total_sat_score >= 1600 then 36 WHEN a.total_sat_score >= 1580 and a.total_sat_score < 1600 THEN 35 WHEN a.total_sat_score >= 1520 and a.total_sat_score < 1580 THEN 34 WHEN a.total_sat_score >= 1470 and a.total_sat_score < 1520 THEN 33 WHEN a.total_sat_score >= 1420 and a.total_sat_score < 1470 THEN 32 WHEN a.total_sat_score >= 1380 and a.total_sat_score < 1420 THEN 31 WHEN a.total_sat_score >= 1340 and a.total_sat_score < 1380 THEN 30 WHEN a.total_sat_score >= 1300 and a.total_sat_score < 1340 THEN 29 <…lines removed…> WHEN a.total_sat_score >= 500 and a.total_sat_score < 560 then 11 WHEN a.total_sat_score < 500 then 0 ELSE 99 END AS sat_act_equiv
CASE: Multiple Recode Example 3 CASE WHEN home_loc_cd in ('003','019','037','053','123','139','163') THEN 'Metro' WHEN (home_loc_cd between '000' and '200' and home_loc_cd not in ('003','019','037','053','123','139','163')) THEN 'GreaterMN' WHEN home_loc_cd = 'MN' THEN 'GreaterMN' WHEN home_loc_cd = 'WI' THEN 'Wisconsin' WHEN home_loc_cd in ('ND','SD') THEN 'Dakotas' WHEN home_loc_cd in ('AK','AL','AR','AS','AZ','CA','CO','CT','DC','DE','FL','GA','GU', 'HI','IA','ID','IL','IN','KY','KS','LA','MA','MD','ME','MI','MO','MS', 'MT','NC','NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR','RI', 'SC','TN','TT','TX','UT','VA','VI','VT','WA','WV','WY','USA') THEN 'USA' WHEN home_loc_cd IS NULL THEN 'Missing' ELSE 'Foreign' END AS region
CASE and Aggregates • One of the limitations of the aggregate functions in SELECT and GROUP BY clauses we discussed earlier is that it is not possible to produce summaries at different levels of aggregation • CASE statements, when combined with aggregate functions, allow for the creation of “conditional” aggregate functions, where records are only summarized if they meet certain conditions, therefore allowing summaries to be created at multiple levels of aggregation simultaneously
CASE and Aggregates • To integrate CASE statements into aggregate functions, place the entire CASE statement, minus the “AS” column naming, inside the aggregate function, as below: • (Conditional Sum) SUM(CASE term WHEN ‘1069’ THEN tuition_net_instr ELSE 0 END) AS Fall_2006_Tuition • (Conditional Count) SUM(CASE acad_career WHEN ‘UGRD’ THEN 1 ELSE 0 END) AS ug_total
CASE and Aggregates Example SELECT acad_career, SUM(CASE acad_group_distr WHEN ‘TCLA’ THEN tuition_net_acad_prog ELSE 0 END) as reg_tuition, SUM(CASE area_class WHEN ‘TCLA’ THEN tuition_net_instr ELSE 0 END) as instr_tuition FROM ps_dwta_stdnt_crse WHERE institution = ‘UMNTC’ AND term = ‘1069’ GROUP BY acad_career ORDER BY acad_career DESC
CASE and Aggregates Results • Query produced conditional summaries for tuition generated by CLA students and CLA courses • Where acad_group_distr was not ‘TCLA’ the tuition generated was set to zero, so non-CLA students did not contribute to the total • Where area_class was not ‘TCLA’, the tuition generated was also set to zero, so non-CLA courses did not contribute to the total
SUBQUERIES • SUBQUERIES, like UNIONS, are ways of creating compound queries that produce a single set of results. • There are multiple ways to accomplish almost everything in SQL. • In theory, any result that can be produced through a SUBQUERY can also be produce through a properly formed set of table JOINS. • For some questions, however, SUBQUERIES are the easiest method.
SUBQUERIES • As their name suggests, SUBQUERIES are secondary, dependent queries attached to the main query. • SUBQUERIES can have all of the same components as the primary query, and can return either a single record or a set of records. • While it is not absolutely necessary, SUBQUERIES often have JOIN conditions to match records in the primary query, and are often based on SELF JOINS, pulling data from the same source as the primary query.
SUBQUERIES & Current Records • When working with the University of Minnesota Data Warehouse, one of the most powerful uses of SUBQUERIES is to find the most current record, using effective dates and effective sequences. • Many tables in the Data Warehouse have a “current” flag of some form that allows the most current record to be identified with a simple WHERE clause statement (e.g. WHERE curr_flg = ‘Y’). • The current flag, however, does not work if you want to reconstruct the state at some earlier point in time.
SUBQUERIES: Locations • SUBQUERIES can be placed anywhere in a query, including SELECT, WHERE, or HAVING clauses, or embedded inside another SUBQUERY. • The most common location, however, is in the WHERE clause, and that will be the focus of the examples that follow. • In the WHERE clause, a SUBQUERY allows information from a different table (or another part of the same table) to be used as a criterion for selecting rows.
SUBQUERIES: Basic Structure SELECT columns FROM tables WHERE conditions (SELECT columns FROM tables WHERE conditions) GROUP BY grouping_columns HAVING conditions ORDER BY ordering_columns
SUBQUERIES: Example 1 SELECT DISTINCT a.emplid, a.term, a.subject, a.catalog_nbr FROM ps_dwsa_stdnt_enrl_hs a WHERE a.institution = 'UMNTC' AND a.acad_career = 'UGRD' AND a.emplid in (SELECT DISTINCT b.emplid FROM ps_dwsa_stdnt_enrl_hs b WHERE b.subject = 'CHIC‘ AND b.catalog_nbr = '1112') AND a.term > (SELECT MAX(c.term) FROM ps_dwsa_stdnt_enrl_hs c WHERE a.emplid = c.emplid AND c.subject = 'CHIC' AND c.catalog_nbr = '1112')
SUBQUERIES: Example 2 SELECT a.emplid, a.area, a.org, sum(a.ytd_exp_amt+a.enc_amt) as amount FROM ps_dwpy_monthly_sum_07 a WHERE a.area = '465' AND a.org = '1506' AND a.obj in ('7000','7005','7010','7015') AND a.acctg_pd = (SELECT MAX(b.acctg_pd) FROM ps_dwpy_monthly_sum_07 b WHERE a.emplid = b.emplid AND a.area = b.area AND a.org = b.org AND a.obj = b.obj) GROUP BY a.emplid, a.area, a.org, a.obj HAVING sum(a.ytd_exp_amt+a.enc_amt) > 0 ORDER BY a.emplid, a.area, a.org, a.obj
Wrap-up Q&A Questions? Thank you!