390 likes | 506 Views
SQL. What is SQL?. DML DDL Partially declarative Based on the algebra via the tuple calculus, and therefore its core has an elegant set-theoretic foundations Provides a sound foundation for mathematically precise optimization
E N D
What is SQL? • DML • DDL • Partially declarative • Based on the algebra via the tuple calculus, and therefore its core has an elegant set-theoretic foundations • Provides a sound foundation for mathematically precise optimization • Has very simple structure; optimized for moving large numbers of tuples at once • Set-based, retrieval-based
What are relational databases? Tables and SQL, a programming language engineered for high volume data applications “must always be correct” data management apps transaction-based applications non-network, non-object based data
Details of Select, continued • Select * means all columns (attributes) • Using arithmetic • SELECT invoice_total - payment_total – credit_total AS balance_due • Using a function • SELECT CONCAT(first_name, ' ', last_name) AS full_name • Renaming a column • SELECT invoice_number AS "Invoice Number", invoice_date AS Date, invoice_total AS Total FROM invoices
Calculations & concatenation & strings SELECT invoice_total, payment_total, credit_total, invoice_total - payment_total - credit_total AS balance_due FROM invoices SELECT vendor_city, vendor_state, CONCAT(vendor_city, vendor_state) FROM vendors SELECT vendor_contact_first_name,vendor_contact_last_name CONCAT(LEFT(vendor_contact_first_name, 1), LEFT(vendor_contact_last_name, 1)) AS initials FROM vendors
Date function, round function SELECT invoice_date, DATE_FORMAT(invoice_date, '%m/%d/%y') AS 'MM/DD/YY', DATE_FORMAT(invoice_date, '%e-%b-%Y') AS 'DD-Mon-YYYY' FROM invoices SELECT invoice_date, invoice_total, ROUND(invoice_total) AS nearest_dollar, ROUND(invoice_total, 1) AS nearest_dime FROM invoices
Where clause format WHERE [NOT] search_condition_1 {AND|OR} [NOT] search_condition_2 ... WHERE NOT (invoice_total >= 5000 OR NOT invoice_date <= '2011-08-01') WHERE invoice_total < 5000 AND invoice_date <= '2011-08-01'
In phrase and nested selects WHERE test_expression [NOT] IN ({subquery|expression_1 [, expression_2]...}) WHERE vendor_id IN (SELECT vendor_id FROM invoices WHERE invoice_date = '2011-07-18')
Like clause WHERE vendor_city LIKE 'SAN%' Cities that will be retrieved “San Diego”, “Santa Ana” WHERE vendor_name LIKE 'COMPU_ER%' Vendors that will be retrieved “Compuserve”, “Computerworld”
Order By clause SELECT vendor_name, CONCAT(vendor_city, ', ', vendor_state, ' ', vendor_zip_code) AS address FROM vendors ORDER BY vendor_name
Chapter 4: Manipulating multiple tables • Join • Equijoin • Natural join • Outer join • Self join • N-way join
Explicit versus implicit joins SELECT vendor_name, invoice_number, invoice_date, line_item_amount, account_description FROM vendors v JOIN invoices i ON v.vendor_id = i.vendor_id JOIN invoice_line_items li ON i.invoice_id = li.invoice_id JOIN general_ledger_accountsgl ON li.account_number = gl.account_number WHERE invoice_total - payment_total - credit_total > 0 ORDER BY vendor_name, line_item_amount DESC SELECT invoice_number, vendor_name FROM vendors v, invoices i WHERE v.vendor_id = i.vendor_id ORDER BY invoice_number
Left outer join and natural join SELECT vendor_name, invoice_number, invoice_total FROM vendors LEFT JOIN invoices ON vendors.vendor_id = invoices.vendor_id ORDER BY vendor_name SELECT invoice_number, vendor_name FROM vendors NATURAL JOIN invoices ORDER BY invoice_number
Union operator SELECT 'Active' AS source, invoice_number, invoice_date, invoice_total FROM active_invoices WHERE invoice_date >= '2011-06-01' UNION SELECT 'Paid' AS source, invoice_number, invoice_date, invoice_total FROM paid_invoices WHERE invoice_date >= '2011-06-01' ORDER BY invoice_total DESC
Chapter 5: Aggregates in queries AVG([ALL|DISTINCT] expression) SUM([ALL|DISTINCT] expression) MIN([ALL|DISTINCT] expression) MAX([ALL|DISTINCT] expression) COUNT([ALL|DISTINCT] expression) COUNT(*) – counts nulls
Issues with aggregate data • This is when we intentionally create a new form of “object” and shift to ones that are deliberately not related to individual tuples • It’s selections and projections that often unintentionally lose a perspective on object identity
Aggregate examples SELECT COUNT(*) AS number_of_invoices, SUM(invoice_total – payment_total – credit_total) AS total_due FROM invoices WHERE invoice_total – payment_total – credit_total > 0 SELECT 'After 1/1/2011' AS selection_date, COUNT(*) AS number_of_invoices, ROUND(AVG(invoice_total), 2) AS avg_invoice_amt, SUM(invoice_total) AS total_invoice_amt FROM invoices WHERE invoice_date > '2011-01-01‘ SELECT 'After 1/1/2011' AS selection_date, COUNT(*) AS number_of_invoices, MAX(invoice_total) AS highest_invoice_total, MIN(invoice_total) AS lowest_invoice_total FROM invoices WHERE invoice_date > '2011-01-01'
Aggregate examples, continued SELECT MIN(vendor_name) AS first_vendor, MAX(vendor_name) AS last_vendor, COUNT(vendor_name) AS number_of_vendors FROM vendors SELECT COUNT(DISTINCT vendor_id) AS number_of_vendors, COUNT(vendor_id) AS number_of_invoices, ROUND(AVG(invoice_total), 2) AS avg_invoice_amt, SUM(invoice_total) AS total_invoice_amt FROM invoices WHERE invoice_date > '2011-01-01'
Having clause (with group by) SELECT select_list FROM table_source [WHERE search_condition] [GROUP BY group_by_list] [HAVING search_condition] [ORDER BY order_by_list]
Group by and Having: complexities • Group by: how to vertically group tuples • Having: which groups will be included in the final result • Note: All of this is after the Where clause is applied • Group by: based on columns or expressions that have columns in them • If there are any calculations done in the Select clause, this happens after the Group by clause; i.e., it is performed for each group that results from the Group by • Group by can be nested if you specify more than one column • Order by operators are performed after the Having
Having example, group by example SELECT vendor_id, ROUND(AVG(invoice_total), 2) AS average_invoice_amount FROM invoices GROUP BY vendor_id HAVING AVG(invoice_total) > 2000 ORDER BY average_invoice_amountDESC (Note: 2 is the number of decimals in the result.) SELECT vendor_name, COUNT(*) AS invoice_qty, ROUND(AVG(invoice_total),2) AS invoice_avg FROM vendors JOIN invoices ON vendors.vendor_id = invoices.vendor_id WHERE invoice_total > 500 GROUP BY vendor_name ORDER BY invoice_qty DESC
These are the same… SELECT invoice_date, COUNT(*) AS invoice_qty, SUM(invoice_total) AS invoice_sum FROM invoices GROUP BY invoice_date HAVING invoice_date BETWEEN '2011-05-01' AND '2011-05-31' AND COUNT(*) > 1 AND SUM(invoice_total) > 100 ORDER BY invoice_dateDESC SELECT invoice_date, COUNT(*) AS invoice_qty, SUM(invoice_total) AS invoice_sum FROM invoices WHERE invoice_date BETWEEN '2011-05-01' AND '2011-05-31' GROUP BY invoice_date HAVING COUNT(*) > 1 AND SUM(invoice_total) > 100 ORDER BY invoice_date DESC
We will shift our focus a bit… • We’ll run the actually queries that appear in chapters 5 onward • The slides will contain overview material and not the actual queries
Chapter 6: subqueries • Often used to pass an aggregate value to a parent query • Often a good way to book-keep what might have been a very complex WHERE clause, with perhaps a multiway join • A good way to make a query look more readable to someone who uses it later • Subqueries can also be reused in other queries • Note: you cannot use the SELECT attributes from the embedded query in the parent query unless it directly references the appropriate table(s) in the outer FROM clause
Subqueries, continued • Important tool: IN operator, which is a set “element of” operator (written with an epsilon), and you can negate the IN (not IN) • Often you use SOME or ALL or ANY, if you want to return multiple values (i.e., multiple tuples, perhaps with only one attribute each) • The default if you don’t use one of the operators above is to return a single value • ALL is a set “for all elements of” operator (written with an upside down A) • ANY is a set “there exists” operator (written with a backward E)
Subqueries, continued • The default is that a subquery only executes once, but you can use a “correlated” query so that it will run once for each row processed by the parent query. • This breaks the “execute from the inside out” paradigm on an uncorrelated subquery. • EXISTS is often used with correlated queries • You can put a subquery in a HAVING, FROM, or SELECT clause as well • But such queries get messy and we will skip this for now. • It’s a good idea to write and test subqueries independently whenever possible, unless they are trivial
Examples from chapter 6 • 7: passing a single value • 6: ANY • 3: IN • 8: NOT EXISTS • 5: ALL
Chapter 7: changing the database state • This is when we need a transaction protocol • Updates must never overlap with each other or with read-only queries • Read-only queries can overlap • But we want to increase through put by supporting as much “concurrency” as possible • Each transaction has the potential to update the DB state
2 Phase Transactions • Each SQL program is within a begin and end transaction pair • Each transaction has its own workspace for DB items it is going to update • Any transactions that overlap in execution time will appear to have run in some serial order • This is done by transactions requesting read and write locks (also known as shared and exclusive locks) • Read locks can be shared with other readers • Write locks cannot be shared with readers or writers • All locks held until the end of the transaction • They are released and then the changes that a transaction has made are moved to the DB
Serializability of transactions • The net effect is that the transactions that overlap in execution time appear to have run in some serial order • Transactions can be undone by throwing away the local store (conceptually, at least) • The write period at the end of the transaction must be atomic • The two phases: • Request read, write, upgrade locks (and wait on locks) and process • Release locks and move updates to the DB • There is a notion of “serializability” which means that the actual schedule of executed steps corresponds to some serial order of running the transactions
Interesting facts on transactions… • Various legal schedules might produce different results • A crash during phase two can lead the database inconsistent • The transaction manager uses a lot of overhead resources handling locks • We still need to be able to roll the database back and rerun transaction logs • User must control the nature of overlapping transactions or there might be very little true concurrency • In a distributed database, the lock manager is a bottleneck because all components of the database must move in lockstep
Updating data: changing the DB state INSERT INTO invoices VALUES (115, 97, '456789', '2011-08-01', 8344.50, 0, 0, 1, '2011-08-31', NULL); INSERT INTO invoices VALUES (116, 97, '456701', '2011-08-02', 270.50, 0, 0, 1, '2011-09-01', NULL), (117, 97, '456791', '2011-08-03', 4390.00, 0, 0, 1, '2011-09-02', NULL), (118, 97, '456792', '2011-08-03', 565.60, 0, 0, 1, '2011-09-02', NULL);
Updating tables, continued USE ex; INSERT INTO color_sample (color_number) VALUES (606); INSERT INTO color_sample (color_name) VALUES ('Yellow'); INSERT INTO color_sample VALUES (DEFAULT, DEFAULT, 'Orange'); INSERT INTO color_sample VALUES (DEFAULT, 808, NULL); INSERT INTO color_sample VALUES (DEFAULT, DEFAULT, NULL);
Updating tables, continued INSERT INTO invoice_archive SELECT * FROM invoices WHERE invoice_total - payment_total - credit_total = 0; INSERT INTO invoice_archive (invoice_id, vendor_id, invoice_number, invoice_total, credit_total, payment_total, terms_id, invoice_date, invoice_due_date) SELECT invoice_id, vendor_id, invoice_number, invoice_total, credit_total, payment_total, terms_id, invoice_date, invoice_due_date FROM invoices WHERE invoice_total - payment_total - credit_total = 0;
Updating, continued UPDATE invoices SET payment_date = '2011-09-21', payment_total = 19351.18 WHERE invoice_number = '97/522'; UPDATE invoices SET terms_id = 1 WHERE vendor_id = 95; UPDATE invoices SET credit_total = credit_total + 100 WHERE invoice_number = '97/522';
Updating, continued DELETE FROM general_ledger_accounts WHERE account_number = 306; DELETE FROM invoice_line_items WHERE invoice_id = 78 AND invoice_sequence = 2; DELETE FROM invoice_line_items WHERE invoice_id = 12; DELETE FROM invoice_line_items WHERE invoice_id IN (SELECT invoice_id FROM invoices WHERE vendor_id = 115);