280 likes | 410 Views
Info systems Spring 2013. Focus of class. Cutting edge topics in information systems Conceptual material, as opposed to coding If you don’t have a good knowledge of relational databases and SQL, now is a good time to study it… A good MySQL book: murach mysql. Class blog & reaching me.
E N D
Focus of class • Cutting edge topics in information systems • Conceptual material, as opposed to coding • If you don’t have a good knowledge of relational databases and SQL, now is a good time to study it… • A good MySQL book: murachmysql
Class blog & reaching me • http://rogerking.me • All assignments and announcements will be on the blog. • I will not send out class email. • Only read blog entries starting on December 30. • To contact me • buzzking@ymail.com • Do not use my CU mail, please. It is inundated with Spam. • 303 437 7419. • To talk to me, see me after class, or send email for an appointment. • rogerbuzzking on Skype, but I do not stay logged into Skype; you must call or email first.
Required books • books: • NoSQL Distilled and • Seven Databases in Seven Weeks
Assignments and exams • Two exams – material will be from • 2 required books • Slides posted on blog, along with corresponding lectures • Handouts posted on blog • Exam and projects • Midterm – 35 % of final grade • On March 11 • Second exam – 35 % of final grade • This will be on the last schedule day of class, NOT at the final slot • i.e., it will be on May 1 • Small projects/assignments – 30% of final grade total
The First Assignment • Build a simple application using Mongo • Groups of 1 or 2 • Due date to be posted on blog • The application will hold very large video and audio objects • More later…
Suggestions… • Please come to class. • Feel free to suggest references and/or topics. • Speak up at any time – it would be nice to have an interactive class.
Likely and possible topics • Post relational database systems • Full text processing • “Big” data • Databases and XML • Media and databases (images, video, audio, animation) • Personal information systems (not just databases) • Integrating diverse data sets • ???
Review of relational databases • Schema based technology • Small schema • Lots of data • SQL, set-based data manipulation • Multi-language application development • Strict 2 phase ACID transactions • Rigidly structured, identically structured, simply structured data • User defined types extensions for “objects in the small”
Relational basics • Tables (relations) • Columns (attributes) • Domains • Rows (tuples) • Primary and candidate keys • Foreign keys • Null values • Triggers • SQL as DDL and DML
Relational applications • A lot of data, but in amounts no longer considered huge • Transactions manipulating sets of rows • Complete accuracy is required • Not real time or interactive • If distributed, only a modest number of centrally controlled servers • If distributed, there is a single, unified schema • Complex objects built by connecting tables • Logic of data is mostly in schema and SQL, not app • Mix of reads and writes, without writes being minimal • SQL programs (run as transactions) compiled once, and optimized, run many times
Important facts about SQL DBs • Elegant, set theoretic underpinnings • Originally defined with an algebra: • Projection, join, union/difference/intersection, selection • Closed under this algebra • SQL is a tuple calculus based on the algebra • Declarative • Far more compact than algebra • Equally elegant algebra for optimization • Query results are legal view tables • Serious impedance mismatch between o-o languages and SQL
Syntax of SQL SELECT select_list FROM table_sourceWHERE search_condition
Example SELECT invoice_total, payment_total, credit_total, invoice_total - payment_total - credit_total AS balance_due FROM invoices WHERE NOT (invoice_total >= 5000 OR NOT invoice_date <= '2011-08-01')
Example SELECT 'Paid' AS source, invoice_number, invoice_date, invoice_total FROM paid_invoices WHERE invoice_date >= '2011-06-01' ORDER BY invoice_total DESC
Manipulating multiple tables • You can use the Where clause, T.color = R.color • You can use an explicit join • Equijoin, outerjoin SELECT vendor_name, invoice_number, invoice_total FROM vendors LEFT JOIN invoices ON vendors.vendor_id = invoices.vendor_id ORDER BY vendor_name
Aggregates AVG([ALL|DISTINCT] expression) SUM([ALL|DISTINCT] expression) MIN([ALL|DISTINCT] expression) MAX([ALL|DISTINCT] expression) COUNT([ALL|DISTINCT] expression) COUNT(*) – counts nulls
Example SELECT COUNT(*) AS number_of_invoices, SUM(invoice_total – payment_total – credit_total) AS total_due FROM invoices WHERE invoice_total – payment_total – credit_total > 0
Strict 2 phase transactions • Each SQL program is within a begin and end transaction pair • Each transaction has its own workspace for DB items it is going to update • Any transactions that overlap in execution time will appear to have run in some serial order • This is done by transactions requesting read and write locks (also known as shared and exclusive locks) • Read locks can be shared with other readers • Write locks cannot be shared with readers or writers • All locks held until the end of the transaction • They are released and then the changes that a transaction has made are moved to the DB
Serializability & transactions • The net effect is that the transactions that overlap in execution time appear to have run in some serial order • Transactions can be undone by throwing away the local store (conceptually, at least) • The write period at the end of the transaction must be atomic • The two phases: • Request read, write, upgrade locks (and wait on locks) and process • Release locks and move updates to the DB • There is a notion of “serializability” which means that the actual schedule of executed steps corresponds to some serial order of running the transactions
Major concept: DB “state” • Changes at the end of each transaction (potentially) • Relies on 2 Phase transactions • We tradeoff throughput in favor of having safe state changes
Normalization • Start with an entity model • Map to tables • Create PKs and FKs • Create other constraints • Normalize tables • Goals • Minimize redundant data • Minimize “update anomalies” • Minimize misinterpreted db by documenting more constraints • Problems • Skinny tables and more joins • Increases impedance mismatch
Dependencies • FD • We say that ai FD-> aj • Or “ai functionally determines aj” • MVD-> • We say that ai MVD-> aj • Or “ai multivalued determines aj” • Note: the right side of an FD or an MVD can be a set of attributes
Third normal form • First (1NF) The value stored at the intersection of each row and column must be a scalar value, and a table must not contain any repeating columns. • Second (2NF) Every non-key column must depend on the entire primary key. • Third (3NF) Every non-key column must depend only on the primary key.
Fixed 3NF and 4NF • Boyce-Codd (BCNF) A non-key column can’t be dependent on another non-key column. • Fourth (4NF) A table must not have more than one multivalued dependency, where the primary key has a one-to-many relationship to non-key columns.
Example Customer ID Address ZIP 18 112 First 80304 17 123 Ash 80303 16 123 Ash 80303 Break into two tables: Customer ID Address Address Zip
Example Mothers_PhoneFathers_PhoneChild_Name Break into: Mothers_PhoneChild_Name 3030000000 Sue 3031111111 Sue AndFathers_PhoneChild_Name 3032222222 Sue 3033333333 Sue Note: both fields needed for PK
Stored programs • Stored procedures (can be called by an application) • Stored functions (can be called by an SQL program) • Triggers (tied to an operation like INSERT) • Events (tied to a clock)