450 likes | 629 Views
SQL 101: An Introduction to writing queries with SQL at the University of Minnesota (Day 2). Peter M. Radcliffe Senior Analyst College of Liberal Arts February 15 th , 2007. Grouping. Another method for reducing information is to calculate aggregate or summary functions on the data
E N D
SQL 101:An Introduction to writing queries with SQL at the University of Minnesota (Day 2) Peter M. Radcliffe Senior Analyst College of Liberal Arts February 15th, 2007
Grouping • Another method for reducing information is to calculate aggregate or summary functions on the data • Remembering our SQL skeleton, this adds new clauses: • SELECT • FROM • WHERE • GROUP BY • HAVING • ORDER BY • We will return to HAVING later – first we will discuss adding aggregations to SELECT and using GROUP BY
SELECT: Aggregate Functions • Some aggregate functions are: • SUM = Total of all values in the numeric expression • AVG = Average of values in numeric expression • COUNT = Number of non-null values in expression (can be used with all data-types, not just numeric) • COUNT(*) = The number of selected rows…always returns one and only one value…the count of the rows in the column or expression) • MAX(expression) = The highest value (can be used with all data-types, not just numeric) • MIN(expression) = The lowest value (can be used with all data-types, not just numeric)
SELECT: Aggregate Functions • Aggregate Examples: • SELECT AVG(tuition * 2) • Returns the average tuition paid if tuition was doubled • SELECT AVG(DISTINCT tuition * 2) • If you include DISTINCT in your expression, it will eliminate duplicate values before calculating the sum, average, or count of an expression. • SELECT MIN(descr) • The description value that is first alphabetically (e.g. “Accounting”) • SELECT MAX(descr) • The description value that is last alphabetically (e.g. “Zoology”)
Aggregates and WHERE clause • You cannot use aggregates in the WHERE clause, however, you can use WHERE to restrict the rows that are used in your aggregate calculation to specific conditions • The HAVING clause is used to restrict rows based on aggregates • EXAMPLE: SELECT AVG(tuition_net_instr) FROM sysadm.ps_dwta_stdnt_crse WHERE institution = ‘umntc’
SELECT: Aggregate Functions • Aggregate Function Example (Question): • What will be the difference between results for the following statements (assuming data is being retrieved from same table source)? • SELECT COUNT(emplid) • SELECT COUNT(DISTINCT emplid) • SELECT DISTINCT COUNT(emplid) • Number of rows with valid values for emplid • Number of unique values of emplid • Same as #1, since count function is producing only one value, therefore select distinct does not change results
SELECT: Aggregates & Details • You cannot simultaneously produce both individual results and aggregates. • Solutions: • You can run separate queries to get row-by-row data or aggregate results. • Join tables (addressed soon) where one table holds the aggregate data and the other the detail data • This can often be produced using a “subquery”, which we will also cover later
GROUP BY: Overview • GROUP BY • Optional • Use to organize results by groups represented in column(s) • GROUP BY works like DISTINCT, as it divides the table into groups and returns one row for each group • Whenever you use GROUP BY, each item in the select list has to produce a single value per set.
SELECT & GROUP BY • The SELECT and GROUP BY clauses must work together • All non-aggregate columns in SELECT must be in GROUP BY • The reverse is not true, but generally a good idea • That is, you can group by a column that is not in the SELECT statement, but since it does not appear in the output, you will not know what aggregate corresponds to what grouping
SELECT & GROUP BY Example • Compare these queries, modifying our earlier examples: 1) SELECT count(*) FROM sysadm.ps_dwhr_job WHERE jobcode = ‘9370’ GROUP BY name 2) SELECT name, count(*) FROM sysadm.ps_dwhr_job WHERE jobcode = ‘9370’ GROUP BY name
GROUP BY & SELECT DISTINCT • GROUP BY creates a list of distinct values much like SELECT DISTINCT • EXAMPLE SELECT acad_plan FROM ps_acad_plan_tbl GROUP BY acad_plan …produces the same results as… SELECT DISTINCT acad_plan FROM ps_acad_plan_tbl …because no aggregate function was specified
GROUP BY: Multi-Column Groups • Frequently, we want to define a group by multiple characteristics, for example combinations of Area, Fund, and Org in CUFS financial data. • Groups can be defined by as many variables as desired, so long as all of the columns that form the group (the non-aggregated columns) are specified in the GROUP BY clause • To display results sorted by the same columns in the same order, repeat the columns from the GROUP BY clause in the ORDER BY clause • If a different sort order is desired, the GROUP BY and ORDER BY clauses can contain different columns in a different order
GROUP BY: Multi-Column Groups • Example: Instructional (75%) Tuition by Course Section SELECT acad_org, acad_org_descr, subject, catalog_nbr, class_section, sum(tuition_net_instr) as instructional_tuition FROM sysadm.ps_dwta_stdnt_crse WHERE institution = ‘UMNTC’ AND area_class = ‘TPUB’ AND term = ‘1069’ GROUP BY acad_org, acad_org_descr, subject, catalog_nbr, class_section ORDER BY acad_org_descr, acad_org, subject, catalog_nbr, class_section
GROUP BY: Multi-Aggregate Groups • Within any group, as many aggregates as desired may be calculated, including any combination of available functions (for example sums, averages, and counts) • However, all the aggregations will take place at the same level (defined by the same group) • It is therefore not possible to calculate, for example, collegiate and departmental totals in the same query • We will address how to produce aggregates at different levels simultaneously when we discuss the CASE statement
GROUP BY: Multi-Aggregate Groups • Example: Registration Tuition Summary SELECT acad_career, acad_level_bot, SUM(tuition_net_acad_group) as reg_tuition, SUM(unt_taken) as sch, COUNT(distinct emplid) as headcount FROM sysadm.ps_dwta_stdnt_crse WHERE institution = ‘UMNTC’ AND acad_group_distr = ‘TPHR’ AND term = ‘1069’ GROUP BY acad_career, acad_level_bot ORDER BY acad_career desc, acad_level_bot
GROUP BY: Null Values • Null values in the grouping column(s) • Nulls are grouped together as a separate group. • A COUNT function will show the number of rows in that group, and therefore the number of null values in the grouping column(s). • Null values in the aggregate function(s) • Most aggregate functions ignore nulls • A COUNT function, for example, will count only the non-null rows.
Grouping and Null Values • You can categorize NULL values in a named group within a column with the following: SELECT NVL(ethnic_group, “Unknown”), count(*) FROM PS_DWSA_STIX_1069 GROUP BY ethnic_group • Results: You will have coded all NULLS as Unknown and then counted the number of rows in each known ethnic category as well as the number of rows in the “Unknown” group containing NULLS.
SELECT: Null Values • To count the number of null values in a column: SELECT ethnic_group, count(*) FROM ps_dwsa_demo_addr GROUP BY ethnic_group OR SELECT count(*) FROM ps_dwsa_demo_addr WHERE ethnic_group is null
Including Null Values in Aggregates • You can also include NULLS in your calculations by adding an OR comparison to the WHERE clause: SELECT COUNT(*) FROM sysadm.ps_dwsa_stix_1069 WHERE ethnic_cd = ‘Unknown’ OR ethnic_cd is NULL • Note that the aggregate function must not itself involve calculations on null values (e.g. calculating an average)
Connecting Multiple Queries • UNION and UNION ALL • Assembles data from any number of queries into a single result set • Each query must have the same columns in their select statement, but the rest of the queries can vary • UNION eliminates duplicate rows produced by the various queries • UNION ALL leaves duplicate rows in the result set
UNION Example SELECT “Registration” as tuition_type, acad_career SUM(tuition_net_acad_prog) as tuition, SUM(ufees_acad_prog) as ufees, SUM(unt_taken) as sch, COUNT(DISTINCT emplid) as headcount FROM sysadm.ps_dwta_stdnt_crse WHERE term = ‘1069’ AND acad_group_distr = ‘TNUR’ UNION SELECT “Instruction” as tuition_type, acad_career, SUM(tuition_net_instr) as tuition, SUM(ufees_instr) as ufees, SUM(unt_taken) as sch, COUNT(DISTINCT emplid) as headcount FROM sysadm.ps_dwta_stdnt_crse WHERE term = ‘1069’ AND acad_group_distr = ‘TNUR’
Working with Multiple Tables • The information needed to answer some questions may not be contained in a single table • To connect tables, a JOIN statement is used in the FROM clause • Linking between tables • Common column(s) • Concordant product (all-to-all) • Structure of joins • One-to-One • One-to-Many • Many-to-Many • One or Many to None
JOIN: Overview • Optional • Part of FROM clause used if you want to retrieve and manipulate data from more than one table • JOIN works by connecting two or more tables by ‘joining’ them, generally on one or more common field, or JOIN column(s) • If there are no common columns identified, a concordant product will be created, linking each row in each table to each row in the other table(s), creating massive tables that consume system resources • On rare occasions, concordant product joins are created purposely
JOIN: How it Works • JOIN prompts the system to • Examine all possible combinations of rows from the two tables • Eliminate the rows that are not specified in WHERE clause • JOIN outlines the specifications that data must meet in order to be included in the query output • If there are NULLS in the JOIN column, the JOIN will not include NULL values • This means NULLS in one table will not be matched to NULLS in the other table
JOIN: Types of Joins • Types of joins • Inner Join – Only rows that match in both tables • Outer Join – All rows from one table and those that match in the other table • Full Join – All rows from both tables, connected where common columns match • Columns from outer or full joins where no matching row was found will contain NULL values • We will also discuss “self joins”, where a table is joined to itself. These “self joins” can take on any of the above forms.
JOIN Clause: Aliases • When tables are joined, it is important to identify which table contains each of the columns in the SELECT clause • Where a column appears in only one of the joined tables, the system will be able to find it without additional information • However, where each table contains columns with the same name (such as the join conditions), the system needs to be told which table’s column to use
JOIN Clause: Aliases • The format for the columns in the SELECT statement is: TABLENAME.COLUMN • The standard convention for simplifying this process is to give each table a short, often single-letter alias, so the user does not have to type the entire table name before each column • This alias is assigned by typing the desired alias immediately after the name of the table in the FROM and JOIN clauses
JOIN Clause: Aliases • Example: FROM sysadm.ps_dwta_stdnt_crse a INNER JOIN sysadm.ps_dwta_stdnt_tuition b • The SELECT clause therefore looks like: SELECT a.emplid, b.tuition_res Instead of… SELECT sysadm.ps_dwta_stdnt_crse.emplid, sysadm.ps_dwta_stdnt_tuition.tuition_res
JOIN Clause: ON statement • To identify the columns to use when matching rows from the tables to be joined, they can be listed immediately after the JOIN statement beginning with the keyword ON, and using the same logical operators as WHERE FROM sysadm.ps_dwta_stdnt_crse a INNER JOIN sysadm.ps_dwta_stdnt_tuition b ON a.emplid = b.emplid AND a.institution = b.institution AND a.term = b.term AND a.acad_career = b.acad_career WHERE a.institution = ‘UMNTC’
JOIN Clause: ON or WHERE • While the ON statement provides cleaner organization, the JOIN conditions can also be listed in the WHERE clause along with any additional selection criteria: FROM sysadm.ps_dwta_stdnt_crse a INNER JOIN sysadm.ps_dwta_stdnt_tuition b WHERE a.emplid = b.emplid AND a.institution = b.institution AND a.term = b.term AND a.acad_career = b.acad_career AND a.institution = ‘UMNTC’
JOINS: Inner Join • INNER JOIN clause (uses = or other logical operator): • You want to view only data from two tables when the values on the join column(s) are equal or satisfy the logical operator • Any logical operator can be used to match records, for example: ‘=’, ‘>’, “>=’, ‘<’, ‘<=’, ‘!=’, or <> • Note that while many logical operators can be used, relationships other than = will produce many matches between the files, returning large numbers of rows • Given the size of the tables in the data warehouse, these joins may produce overwhelming results – proceed with caution in using logical operators other than =
JOINS: Inner Join • Join conditions (matching columns) can appear either in the WHERE clause or in a separate section of the JOIN clause beginning with the keyword “ON” SELECT a.emplid, a.acad_prog_primary, b.tuition_res FROM sysadm.ps_dwta_stdnt_crse a INNER JOIN sysadm.ps_dwta_stdnt_tuition b ON a.emplid = b.emplid
JOINS: Inner Join Mechanics 1234567 7586424 3002847 1234567 2874502 9657513 7654321 8473621 4654785 7654321 • Joining two tables on the column “emplid” Table 1 Table 2
JOINS: Inner Join Example • INNER JOIN EXAMPLE: Tuition by residency of student select 'Instruction' as tuit_type, a.acad_career, b.tuition_res, sum(a.tuition_net_instr) as tuition, sum(a.unt_taken) as sch, count(distinct a.emplid) as head from sysadm.ps_dwta_stdnt_crse_pr a inner join sysadm.ps_dwta_stdnt_tuition_pr b on a.emplid = b.emplid and a.institution = b.institution and a.term = b.term and a.acad_career = b.acad_career where a.term = '1069' and a.area_class = 'TMED' and a.acad_career = 'UGRD' group by a.acad_career, b.tuition_res order by b.tuition_res, a.acad_career desc
JOINS: Inner Join Results Instructional Tuition by Academic Career and Tuition Residency
JOINS: Inner Join Recap • Inner Joins select only those rows in each table that have the same values on the columns specified in the join condition • Null values in the join conditions are not matched to other records because NULL is not a specific value and therefore does not match to another NULL value • Any rows in either table that are not matched on all of the join conditions will be dropped, even though the row may have perfectly valid and relevant data in one table
JOINS: Outer Join • OUTER JOINS keep all rows from one table and connect data from those rows of other table that match the base table on the join condition columns • Critical to decide which table provides the base, therefore maintaining all its rows, and which table provides only selected rows • It is often useful to use the NVL(column,value) command to replace NULL values on the secondary table with default values, e.g. “Missing”, “None”, or 0
JOIN: Types of Joins • TWO TYPES OF OUTER JOINS: • LEFT OUTER JOIN clause: • Include all rows from the first-named table and only the records in the second-named table that meet the join condition (some systems represent this by using the symbol */ , the * representing the table that you want all of the data from). • RIGHT OUTER JOIN clause: • Include all rows from the second-named table and only the records in the first-named table that meet the join condition (some systems represent this by using the symbol /* , the * representing the table that you want all of the data from).
Outer Join Mechanics • Left outer join on the column “emplid” Table 1 Table 2 1234567 7586424 3002847 1234567 NULL 2874502 9657513 NULL 7654321 8473621 4654785 7654321 NULL
JOIN: Outer Join Example • OUTER JOIN EXAMPLE: SCH by Summer Term Grad Assistants SELECT a.AREA, SUM(a.EXP_AMT) as PAYROLL, SUM(b.UNT_TAKEN) as SCH FROM PS_DWPY_MONTHLY_SUM_HIST_VW a LEFT OUTER JOIN PS_DWTA_STDNT_CRSE b ON a.EMPLID = b.EMPLID AND CONCAT('1',CONCAT((SUBSTR(a.FISC_YR,3,2)),'5')) = b.TERM WHERE a.AREA_CLASS = 'TCLA' AND a.FISC_YR = '2006' AND a.OBJ IN ('7000','7005','7010','7015') AND a.JOBCODE IN ('9571','9572','9573','9574','9575') GROUP BY a.AREA ORDER BY a.AREA
JOINS: Outer Join Results SCH and Payroll Totals for Summer Term Grad Assistants
JOINS: Full Outer Join • FULL OUTER JOINS retain all rows from both tables, connecting those rows with matching data in the columns used for the join conditions in both tables • Good for reconciliation where it is important to see what records appear in one table but not the other • In general for the data warehouse, usefulness is limited because with the large tables, it is generally not desirable to maintain all the rows from both tables
JOINS: Outer Join Recap • OUTER JOINS are used to maintain some or all of the non-matching records • OUTER JOINS can be LEFT, RIGHT, or FULL, depending on which table has the rows that are to be maintained • It is critical to make sure there are no WHERE clause criteria that rely on the secondary table, as only those rows that can be matched will have valid values on the criteria columns, and therefore the OUTER JOIN will reduce to an equivalent INNER JOIN • In general, OUTER JOINS are of limited usefulness with the data warehouse due to the extreme size of the tables
JOINS: Multiple Joins Example • Example from CLAIR application called IRIS: FROM t_iris_reporting_crse_sctn r LEFT OUTER JOIN t_iris_core_sir s ON r.cla_xlist_sctn_id = s.cla_xlist_sctn_id INNER JOIN t_iris_core_dict_term t ON r.term = t.terM INNER JOIN web_admin..t_cla_security_dict_cla_dept d ON isnull(s.cla_acad_org_funding,r.cla_acad_org) = d.cla_acad_org
JOINS: Self Joins • SELF JOINS involve connecting a table to another copy of itself • SELF JOINS can take the form of INNER, OUTER, or FULL JOINS • Can be done for many reasons, including creating a full list of units with details for some (OUTER JOIN) or connecting data from multiple rows that share common linking values (join conditions) • The primary reason to use SELF JOINS with data warehouse tables, however, is for constructing SUBQUERIES