310 likes | 439 Views
CS186 Class Wrap-Up. R&G Chapters 1-28 Lecture 28. Administrivia. Final Exam Friday 12/12, 5pm – 8pm, Room 4 LeConte You may have 2 pages of notes, both sides The exam is cumulative Final Exam Review Tuesday 12/9, 1pm-3pm, 306 Soda Hall Homework 5 Due Monday, 12/8. News.
E N D
CS186 Class Wrap-Up R&G Chapters 1-28 Lecture 28
Administrivia • Final Exam • Friday 12/12, 5pm – 8pm, Room 4 LeConte • You may have 2 pages of notes, both sides • The exam is cumulative • Final Exam Review • Tuesday 12/9, 1pm-3pm, 306 Soda Hall • Homework 5 • Due Monday, 12/8
News • Winter Consulting’s 2003 survey of Largest DBs • http://mxtest.wintercorp.com/vldb/2003_TopTen_Survey/TopTenWinners.asp • The largest single database is 29,232 GB! • That’s a single database at France Telecom • Many companies have TBs of data, but usually spread out among multiple databases, file systems, etc. • In 2001, largest DB was ~10TB
News (cont.) – Top Transaction Processing DBs • Land Registry, 18.3 terabytes • BT plc, 11.7 terabytes • United Parcel Service, 9.0 terabytes • Caica Econômica Federal, 6.9 terabytes • US Patent and Trademark Office, 5.4 terabytes • Verizon Communications, 5.3 terabytes • Bureau of Customs and Border Protection, 4.1 TB • Hewlett Packard, 3.2 terabytes • Boeing, 3.1 terabytes • CheckFree Corp, 2.9 terabytes
News (cont) – Top Decision Support DBs • France Telecom, 29.2 terabytes • AT&T, 26.3 terabytes • SBC, 24.8 terabytes • Anonymous, 16.2 terabytes • Amazon.com, 13.0 terabytes • Kmart, 12.6 terabytes • Claria Corp., 12.1 terabytes • HIRA, 11.9 terabytes • FedEx Services, 10.0 terabytes • Vodafone, 9.1 terabytes
Lessons? (from the survey and this course) • DBs are a huge part of business today • Companies have *lots* of data • (imagine tuning UPSs database with 41 billion rows!) • DBs are based on theory of data modelling, with lots of practical data management on top • nice mix of theoretical and practical • In most jobs, useful to understand how DBs work
Today • What topics did we cover? • What topics did we *not* cover?
First, what topics did we not cover? • In the book: • Chapter 21 – Security and Authorization • Chapter 22 – Parallel and Distributed DBs • Chapter 23 – Object-Database Systems • Chapter 24 – Deductive Databases • Chapter 25 – Data Warehousing and Decision Support • Chapter 27 – XML Data • Chapter 28 – Spatial Data Management • Not in the book • Federated Databases...
And what topics did we cover? • Chapters 1-20, and 26
Overview of Database Systems • What is a Database? A Database System? • What are the useful characteristics of DBs? • When should you use a database? When is the file system better?
2. Database Design/ER Models • Databases support many levels of abstraction • possible to design at abstract level in one form, store data in very different form • The E-R Model • Useful for design, easier for human to understand • Specify entities, attributes, relationships • Possible to convert ER schemas to Relational Schemas
3. The Relational Model • Most common data model for databases • Based on tables: rows and columns • Tables connected using key/foreign keys • Integrity Constraints • Domain constraints for field values • Referential integrity for keys/foreign keys • Other constaints specified by real world • e.g. 0.0 <= gpa <= 4.0
4. Relational Algebra and Calculus • Relational algebra • Operators that act on sets of tuples • σ, Π, , –, , , etc. • “procedural” • Relational Calculus • Uses first-order logic to describe query result • does not describe how to get result, i.e. declaritive • studied Tuple Relational Calculus, variables are tuples {S |S Sailors S.rating > 7}
5. SQL: Queries, Constraints, Triggers • Data Definition Language (DDL) • Create Table • Constraints & Triggers • Data Manipulation Language (DML) SELECT [DISTINCT] target-list FROM relation-list WHERE qualification GROUP BY grouping-list HAVING group-qualification • Set Operations, subqueries, etc.
6. Database Applications • How to access DBs from programs • embedded SQL, SQLJ • Dynamic APIs: ODBC, JDBG • Cursors: a way to iterate over relations • Stored procedures in database language • Accessing other programs from databases • Extending postgres with C code
7. Internet Applications • Internet basics: URIs, HTTP stateless protocol • Web data formats: XML, HTML, DTD • Different architectures • Single-tier • Client-server (thick or thin client) • Three-tier architecture • Web browser/thin client • App server running business logic • Database maintaining data
8. Storage and Indexing • Different file organizations • Heap Files (unordered) • Sorted Files • Clustered Files • Unclustered Tree • Unclustered Hash • Tradeoffs in I/O costs for various operations
9. Storing Data: Disks and Files • Hierarchy of storage • Keeping data in files on disk • How to arrange fields into records • How to arrange records into pages • How to arrange pages into files • Managing disk and memory • Buffer management • LRU, MRU, Clock, etc.
10. Tree-Structured Indexes • Trees best for range queries, o.k. for equality • ISAM • less common, usually best for data that doesn’t change • index doesn’t adjust, instead uses overflow pages if leaves fill • B-Trees • present in virtually all databases • tree adjusts index to stay balanced • you should understand these pretty well after Hw4
11. Hash-Based Indexes • Hash indexes best for equality, useless for range queries • Static hashing • only good when data doesn’t change • uses overflow buckets • Extendible hashing • uses directory of buckets, when overflow, double directory size • never needs overflow buckets • Linear hashing • no directory, just a number indicating which buckets have split • may need overflow buckets, but doesn’t need directory
12. Overview of Query Evaluation • System catalogs – info about all tables • includes statistics about field values • Access paths – how to get at tuples • file scan, indexes • Query plan – tree of relational operators
13. External Sorting • Database can sort any amount of info, even if it doesn’t fit in memory • Sort runs that fit in memory, then merge sorted runs together • Used in Hw5
14. Evaluating Relational Operators • How to implement: • Selection • Projection • Join Algorithms: • Nested Loops • Indexed Nested Loops • Sort-Merge Join • Hash-Join
15. A Typical Relational Query Optimizer • Break query into query blocks • Enumerate possible query plans • Evaluate cost for each, choose cheapest
16. Overview of Transactions • Transactions, unit of atomicity • ACID properties • anomolies with concurrent execution • Introduction to logging
17. Concurrency Control • Anomalies • Precedences Graphs • Schedule Charateristics • Seriazable, View Serializable, Conflict Serializable, Recoverable, Avoids Cascading Abort, Strict • Locking approaches: 2PL, strict 2PL • dealing with deadlock • Hierarchical locking • Locking in B-Trees • Non-locking approaches • Optimistic CC • Timestamp CC • Multiversion CC
18. Crash Recovery • Effects of Buffer Management on recovery • Write-ahead log • Transaction abort • Checkpointing • Aries algorithm • Analysis phase • Redo phase • Undo phase
19. Schema Refinement & Normal Forms • Functional dependencies • A B, whenever A is the same, B must be same • FDs allow us to determine candidate keys, normal forms, qualities of decomposition • Tradeoffs between data replication, dependency preservatn • Always must have lossless join decompositions • BCNF has little replication, may need to join to check FDs • 3NF may have replication, but can preserve FDs
20. Physical Database Design and Tuning • Once a DB is running, many changes may improve performance • First need to understand workload • What are typical queries? Which queries are most important? • Indexes – what will improve queries • Schema Changes • denormalize to reduce joins • supernormalize to reduce table size, contention • Rewriting Queries • avoid queries that the optimizer will do poorly on
24. Data Mining • What is Data Mining? • Process of Data Mining • Different classes of DM Algorithms • Supervised • Unsupervised
Summary • Databases are highly important today • DB Design based on theoretical foundation • Numerous practical/implementation issues addressed to make them run efficiently • This course covered enough practical and theoretical so you can use and understand DBs