330 likes | 452 Views
SQL 101: An Introduction to writing queries with SQL at the University of Minnesota. Peter M. Radcliffe Senior Analyst College of Liberal Arts February 1 st , 2007. Class Objectives. Understand the structure and purpose of data warehouse tables Write simple queries using… …multiple tables
E N D
SQL 101:An Introduction to writing queries with SQL at the University of Minnesota Peter M. Radcliffe Senior Analyst College of Liberal Arts February 1st, 2007
Class Objectives • Understand the structure and purpose of data warehouse tables • Write simple queries using… • …multiple tables • …summary columns and grouping • …calculated columns • …inner and outer joins • …case statements • …subqueries • Understand the data warehouse query tool
Class Assumptions • You haven’t written queries in SQL before (or at least not much) • You have access to the data warehouse, but haven’t used it (or at least not much) • You are familiar with Excel • You perform data analysis as part of your job • If you have more background that this, let me know and we’ll speed up the early portion of the course
Relational Databases • Database tables versus spreadsheets • Each table is like an individual spreadsheet • Structure is more important • Entries must be in rows and columns (no isolated cells) • Computations on cells are more difficult, they are more commonly conducted on entire columns • Visualizing a spreadsheet may help with properly constructing a query to give the desired results
Relational Databases • Column and row structure • Rows are observations or records • Columns are the quantities or characteristics observed or recorded • Linking between tables • Common column(s) • Concordant product (all-to-all) • Structure of joins • One-to-One • One-to-Many • Many-to-Many • One or Many to None
Simple Query Structure • Sentence and clause analogy • Each query can be stated as a sentence of the form, “Return observations on the following characteristics, from the following tables, meeting the following criteria, and organized in the following way.” • Each clause in the sentence corresponds to a particular portion of a query. • Clauses • SELECT • FROM • WHERE • ORDER BY • Will address “GROUP BY” and “HAVING” clauses when we discuss summary columns
Using DW Web Query Tool • Data Warehouse web site: http://dw.umn.edu • Web Query Tool login: https://dw1.umn.edu/queries/queries0.asp • Data warehouse user id and password is separate from x.500 • Options for “Querying Method” • Build a Query (point-and-click query building tool) • Free-Form Query (we’ll focus on this) • Choose a Saved Query (after you’ve created some) • Delete Saved Queries (ditto) • Re-Group Saved Queries • List Saved Queries • Will return to these functions after creating first query
First Example Query • Question: • Who are the Senior Analysts at the University of Minnesota? • What do we want to know about them? • Name, CUFS Area code, CUFS Area Name • Columns: name, cufs_area, area_name30 • Source: • PeopleSoft Employment Table, ps_dwhr_job • Criteria: • Senior Analysts, jobcode = 9370
First Example Query Continued • Log into DW web query tool, choosing “Free-Form Query” • Type the following query into the box: SELECT name, cufs_area, area_name30 FROM sysadm.ps_dwhr_job WHERE jobcode = ‘9370’‘ ORDER BY name • Leave “Choose Row Limit” and “Print to File” dialogs on their default values (“10 Rows Paging” and “No File Type Selected”).
First Example Query Continued • Before clicking on “Submit Query” button, notice a couple details • Paragraph style (new lines for each command, indents, etc.) • This is all completely unnecessary. It is simply for clarity. • Syntax • Commas between column names are necessary. • Single quotes are necessary for string values (words), but not for numeric data (in this case, even the seemingly numeric jobcode is a string) • Capitalization matters for the string values in the where clause, but not for the rest of the query.
First Example Query: Results • Viewing Query Results • Click on the “Submit Query” button to view the results • There should be eight individuals listed • Looking at the data, you can imagine this as a spreadsheet with eight rows of data, plus column headers, and three columns. • These results can be downloaded by choosing a different output format, or they can be copied and pasted into Excel.
Saving Results • Saving queries with DW query tool • Notice the “Save Query” button between the query and the results • You can enter a name for a query group, or select and existing one using the left two dialog boxes. • You can enter a name for the query itself or select an existing name if you want to update/replace a query you have already created. • After entering or selecting names, you can hit the “Save Query” button to add it to your personal query library
Managing Saved Queries • You will see your saved queries, organized in groups • Scroll down if necessary to find your newly saved query • Click on radio button next to saved query • Click on “Show Query” button at bottom of list • You should now see your query again in the “Edit Box”
SQL Structure • SELECT [necessary – Used to choose columns from table(s) that you want in your output] • FROM [necessary - State the table(s) that you are pulling the data from] • JOIN [optional - Clause used if you want to pull data from more than one table] • WHERE [optional - Use if you want to limit the results that are pulled based on specified conditions/qualifications] • GROUP BY [optional – Use if you want to organize your results by groups represented in column(s)] • HAVING [optional – For setting the specifications for the groups that you want to display] • ORDER BY [optional – Use if you want the results ordered in ascending (ASC) or descending order (DESC)]
SELECT Clause: Overview • Necessary – Used to choose columns from table(s) that you want in your output • How SELECT works: • Returns only the columns that you list in your SELECT statement. • Rows returned are specified in WHERE clause. • Thus, a SELECT statement indicates the columns to be included in the query output
FROM Clause: Overview • Necessary • Identifies the table(s) from which to retrieve data • In relational databases, data is generally split into multiple tables. The data warehouse was designed to combine data in ways that make inquiry-based sense. • However, it may still be necessary to recombine tables to answer data questions • TIP: Visualize the end result, or table, that you want to see. What columns are in the proposed table? In what tables do those columns reside?
WHERE Clause: Overview • Optional, if omitted all rows are returned • Uses Logical operators to define criteria. • Thus, a WHERE clause indicates the rows to be included in the query output • Columns in the WHERE clause do not have to be included in the SELECT list
ORDER BY Clause: Overview • Optional • Used to present results ordered in ascending (ASC, the default) or descending order (DESC) • If you use the ORDER BY statement, the columns that you are ordering by have to be included in the SELECT statement. • If this clause is omitted, the data will appear in the order it is in the source table
Second Example Query • We want to reduce the number of cases returned by removing duplicates • Edit your first query, adding only the word “DISTINCT” SELECT DISTINCT name, cufs_area, area_name30 FROM sysadm.ps_dwhr_job WHERE jobcode = ‘9370’ ORDER BY name • Leave “Choose Row Limit” and “Print to File” dialogs on their default values (“10 Rows Paging” and “No File Type Selected”), and click “Submit Query”.
SELECT DISTINCT • SELECT DISTINCT • Optional - Used to indicate that you want unique rows (or records) of data for the column(s) specified, thus reducing duplicates • Eliminates rows that are identical on all selected columns • Note that values for columns in the record (row) other than those selected may be different
Second Query Example Results • Switching from SELECT to SELECT DISTINCT eliminated only one row, where all the values for the selected columns were identical • Note that the same individuals still appear on multiple rows, but now those rows indicate different offices where those people worked • If any of these names are familiar, you may well notice that some of them are no longer at the University of Minnesota, and have not been for some time • Clearly, we need to use more criteria to reduce the list
WHERE Clause: Overview • Quick reminder from before: • Optional, if omitted all rows are returned • Uses Logical operators to define criteria • Thus, a WHERE clause indicates the rows to be included in the query output • Columns in the WHERE clause do not have to be included in the SELECT list
WHERE Clause: Operators • Comparisons (=, <, >, >=, <=, !=, <>) • WHERE efc > unmet_need • Combinations of Logical Conditions (AND, OR, NOT) • WHERE unmet_need < 2000 OR efc > 15000 • WHERE effdt > 7/01/2005 AND effdt < 6/30/2006 • Ranges (BETWEEN and NOT BETWEEN) • WHERE effdt BETWEEN 7/01/2005 AND 6/30/2006 • Lists (IN, NOT IN) • WHERE institution IN(‘umntc’) • WHERE ethnicity NOT IN (‘Unknown’,’’,NS) • Unknown Values (IS NULL and IS NOT NULL) • WHERE ethnicity IS NULL • Character Matches (LIKE and NOT LIKE). • Outlines patterns that must be included in quotes by using one or more wildcard symbols (% means any string of zero or more characters; _ means any single character) • WHERE zip LIKE ‘5%’ • WHERE zip NOT LIKE ‘55417’
WHERE: How it Works • Follows a sequential order of operations for arithmetic and logical operators: • Parentheses • Multiplication and Division, • Subtraction and Addition, • NOT, • AND, and • OR • Reads and performs operations left to right
WHERE Clause: Conjunctions • Conjunctions in WHERE clause • AND means the row must meet all of the listed criteria • OR means the row must meet the criteria listed before the OR clause, or the criteria after the where clause • Order and grouping are critical, use parentheses to adjust order of operations • Examples – note the differences in these criteria: • WHERE jobcode = ‘9370’ AND status_flg = ‘C’ AND empl_status = ‘A’ • WHERE jobcode = ‘9370’ AND status_flg = ‘C’ OR empl_status = ‘A’ • WHERE jobcode = ‘9370’ AND (status_flg = ‘C’ AND empl_status = ‘A’)
WHERE Clause: IN & NOT IN • To select rows with many different values that are not in a simple range, you can use several OR statements • Example: WHERE area = ‘466’ or area = ‘481’ or area = ‘502’ • Alternatively, the same statement can be constructed using an IN statement Example: WHERE area IN (‘466’,’481’,’502’) • NOT IN can be used to exclude a set of values Example: WHERE area NOT IN (‘999’,’000’,’ZZZ’)
WHERE: NULL VALUES • Not all rows have data in them for all columns • Where data is actually missing, rather than coded as zero, “missing”, “unknown”, etc., are called null values • Nulls are not actual values, and therefore cannot be referred to in any calculation or comparison • Instead, separate operators are used in the WHERE clause to refer to null values: • IS NULL (example: WHERE cum_gpa IS NULL) • IS NOT NULL (example: WHERE cum_gpa IS NOT NULL) • Sometimes lead to compound statements • WHERE act_comp = 0 OR act_comp IS NULL
Null values in SELECT • In addition to the WHERE clause, null values can be addressed in the SELECT clause • To prevent null values from appearing as the word “NULL” or generally to provide a meaningful label, use the NVL command to replace the null values • Example: SELECT DISTINCT NVL(ethnic_group,’Missing’) FROM sysadm.ps_dwsa_stix_1069 • If the column contains string data, remember to use the single quotes in specifying the replacement value
Third Example Query • Use multiple criteria to reduce rows returned • Edit your previous query to add additional criteria: SELECT DISTINCT name, cufs_area, area_name30 FROM sysadm.ps_dwhr_job WHERE jobcode = ‘9370’ AND status_flg = ‘C’ AND empl_status = ‘A’ ORDER BY name • Submit the query and view results, there should now be four records
Aside: DW Codebooks • How do we know what columns to use in the previous queries? • Click on button marked “Information” at top of Data Warehouse query page (right-clicking and selecting “open in a new window” will make it possible to switch back and forth between codebooks and query window) • Top option on list is “Data Dictionaries”, click here • Click on “Online HR-Job and Demographics Data Dictionary” • Click on “Individual Table Descriptions” • Click on “PS_DWHR_JOB” • Click on “Click Here to See Layout and Sources”
DW Codebooks Continued • Table displayed has information on the columns • # - position in the table, not important for our use • Column Name – label of column to be used in queries • Column Descr – short, sometimes helpful description of column • Type/Size – Data type (string, numeric, etc.) and number of characters in the column • Table Source – PeopleSoft “setup” table from which the column information is taken when building the data warehouse tables • Column Source – Column in the source table that is imported, or “calculated” if the data is created when the table is built • Clicking on a column name will bring up a longer description and often allow you to run a definition query
SELECT: Aliases and Wildcards • To see all of the columns in a table, use the wildcard * in place of the column names • Example: SELECT * FROM ps_area_tbl • This is the simplest form of query that can be executed • To rename a column, give it an “alias” by following the original column name with your desired column name • Example: SELECT emplid AS student_id • Note that the word “AS” is optional in most systems, but is generally helpful for readibility
SELECT: Substring Function • May need only part of a string column • Example: the first four digits of University of Minnesota academic plans indicates the program, the next three the degree sought, and the last two the college that owns the plan. SELECT acad_plan(substring,1,4) as major_code SELECT acad_plan(substring,5,3) as degree_sought SELECT acad_plan(substring,8,2) as college_code