1 / 41

Final Exam Review

Final Exam Review. Lecture 31. Administrivia. Office hours 1:15 – 2:15 today also available via e-mail: ebenh@eecs.berkeley.edu TAs will have extra office hours, see class web page, news group Final Exam May 16 8-11 a.m. Location: 55 Warren Closed book, 2 pages of notes, both sides

eytan
Download Presentation

Final Exam Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final Exam Review Lecture 31

  2. Administrivia • Office hours 1:15 – 2:15 today • also available via e-mail: ebenh@eecs.berkeley.edu • TAs will have extra office hours, • see class web page, news group • Final Exam May 16 8-11 a.m. • Location: 55 Warren • Closed book, 2 pages of notes, both sides • IEEE will provide pastries and juice starting at 7:30!

  3. Final Exam Topics • Up to midterm 1 (25%) • Relational Model & Query Languages (Roth) • Relational Algebra and Calculus • SQL • Database Implementation (Haber) • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 (25%) • Query Execution • Relational Operators (Haber) • Sorting (Haber) • Joining (Haber) • Query Optimization (Roth) • Since midterm 2 (48%) • Database Design (Haber) • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery (Roth) • Guest Lectures (2%)

  4. Why are databases interesting? • Theoretical foundation • Modelling structure of information • Relations: sets of identically structured tuples • Design constraints: FDs and correct decompositions • Formal query languages • Algebra: operators on relations, return relations • Calculus: declarative specification of query result • Practical application of theory • Using computer structures • pages, files, memory, buffer pools, indexes • ACID properties (xacts, concur. control, recovery) • Reasonable efficiency

  5. Review Outline • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

  6. DBMS components • Talks to DBMS to manage data for a specific task • -> e.g. app to withdraw/deposit money or provide a history of the account Database application Query Optimization and Execution • Figures out the best way to answer a question • -> There is always more than 1 way to skin a cat…! • Provides generic ways to combine data • -> Do you want a list of customers and accounts or the total account balance of all customers? Relational Operators • Provides efficient ways to extract data • -> Do you need 1 record or a bunch? Access Methods • Makes efficient use of RAM • -> Think 1,000,000 simultaneous requests! Buffer Management • Makes efficient use of disk space • -> Think 300,000,000 accounts! Disk Space Management DB

  7. The Storage Hierarchy Smaller, Faster • Main memory (RAM) for currently used data. • Disk for the main database (secondary storage). • Tapes for archiving older versions of the data (tertiary storage). Bigger, Slower Source: Operating Systems Concepts 5th Edition

  8. Arm movement Disks are slow. Why? Transfer time • Time to access (read/write) a disk block: • seek time (moving arms to position disk head on track) • rotational delay (waiting for block to rotate under head) • transfer time (actually moving data to/from disk surface) Seek time Rotational delay

  9. Disk Space Manager • Lowest layer of DBMS software manages space on disk (using OS file system or not?). • Higher levels call upon this layer to: • allocate/de-allocate a page • read/write a page • Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! • Responsibility of disk space manager. • Higher levels don’t know how this is done, or how free space is managed. • Though they may make performance assumptions! • Hence disk space manager should do a decent job.

  10. DB Buffer Management in a DBMS Page Requests from Higher Levels • Buffer pool information table contains: <frame#, pageid, pin_count, dirty> BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated by replacement policy

  11. Buffer Management • Keeps a group a disk pages in memory • Records whether each is pinned • What happens when all pages pinned? • Whan happens when a page is unpinned? • Replacement • When all frames used, but not pinned, and new page requested? • How is the replaced page chosen? • Least Recently Used (LRU) • Most Recently Used (MRU) • Clock • Advantages? Disadvantages?

  12. What is in Database Pages? • Database contains files, which are made up of… • Pages, which are made up of… • Records, which are made up of… • Fields, which hold single values.

  13. Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) Pointer to start of free space N 16 24 20 N . . . 2 1 # slots SLOT DIRECTORY How are records/pages organized? • depends on whether fields variable, or fixed length • In Minibase, array of type/offsets, followed by data. • depends on whether records variable, fixed length. • Minibase: slot array at beginning of page, records compacted at end of page. F1 F2 F3 F4 Array of Field Offsets

  14. Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY How are files organized? • Unordered Heap File: chained directory pages, containing records that point to data pages. • Other possibilities: sorted files, clustered indexes, unclustered index + heap file • Many tradeoffs between them

  15. B: The number of data pages R: Number of records per page F: Fanout of B-Tree S: Time required for equality search * Don’t Use Index I/O Cost of Operations

  16. Indexes • Can be used to store data records (alt 1), or be an auxillary data structure that referrs to existing file of records (alt 2, 3) • Many types of index (B-Tree, Hash Table, R-Tree, etc.) • How do you choose the right index? • Difference between clustered and unclustered indexes? Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records

  17. Review Outline • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

  18. Review: Query Processing • Queries start out as SQL • Database translates SQL to one or more Relational Algebra plans • Plan is a tree of operations, with access path for each • Access path is how each operator gets tuples • If working directly on table, can use scan, index • Some operators, like sort-merge join, or group-by, need tuples sorted • Often, operators pipelined, getting tuples that are output from earlier operators in the tree • Database estimates cost for various plans, chooses least expensive

  19. Cost of Operations • Selections • Projections • Sorting, a.k.a. Order By • Removing duplicates, a.k.a. Select Distinct • Joins

  20. Selections: “age < 20”, “fname = Bob”, etc • No index • Do sequential scan over all tuples • Cost: N I/Os • Sorted data • Do binary search • Cost: log2(N) I/Os • Clustered B-Tree • Cost: 2 or 3 to find first record + 1 I/O for each #qualifying pages • Unclustered B-Tree • Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple • Clustered Hash Index • Cost: ~1.2 I/Os to find bucket, all tuples inside • Unclustered Hash Index • Cost: ~1.2 I/Os to find bucket, + ~1 I/O for each matching tuple

  21. Projection • Expensive when eliminating duplicates • Can do this via: • Sorting: cost no more than external sort • Cheaper if you project columns in initial pass, since more projected tuples fit in each page. • Hashing: build a hash table, duplicates will end up in the same bucket

  22. Sorting • External Merge Sort • Minimum amount of memory: 3 pages • Initial runs of 3 pages • Then 2-way merge of sorted runs (2 pages for inputs, one for outputs) • #of passes: 1 + log2(N/3) • With more memory, fewer passes • With B pages, #of passes: 1 + log(B-1)(N/B) • I/O Cost = 2N * (# of passes) • Using B+ Trees for Sorting • Idea: • Retrieve records in order by traversing leaf pages. • Is this a good idea? Cases to consider: • B+ tree is clusteredGood idea! • B+ tree is not clusteredCould be a very bad idea! • I/O Cost • Clustered tree: ~ 1.5N • Unclustered tree: 1 I/O per tuple, worst case!

  23. Remove duplicates with Hashing • Idea: • Many ops don’t need the data ordered • e.g.: removing duplicates in DISTINCT • e.g.: finding matches in JOIN • Good enough to match all tuples with equal values • Hashing does this! • And may be cheaper than sorting! (Hmmm…!) • But how to do it for data sets bigger than memory?? • If we can hash in two passes -> cost is 4N • How big of a table can we hash in two passes? • B-1 “partitions” result from Phase 0 • Each should be no more than B pages in size • 2 passes possible if table smaller than B(B-1) i.e.: can hash a table of size N pages in about √N space • Note: assumes hash function distributes records evenly! • Have a bigger table? Recursive partitioning!

  24. Sorting vs Hashing • Based on our simple analysis: • Same memory requirement for 2 passes • Same IO cost • Digging deeper … • Sorting pros: • Great if input already sorted (or almost sorted) • Great if need output to be sorted anyway • Not sensitive to “data skew” or “bad” hash functions • Hashing pros: • Highly parallelizable • Can exploit extra memory to reduce # IOs

  25. Nested Loops Joins • R, with M pages, joins S, with N Pages • Nested Loops • Simple nested loops • Insanely inefficient M + PR*M*n • Paged nested loops – only 3 pages of memory • M + M*N • Blocked nested loops – B pages of memory • M + M/(B-2) * N • If M fits in memory (B-2), cost only M + N • Index nested loops • M + PR*M* index cost • Only good in M verysmall

  26. Sort-Merge Join • Simple case: • sort both tables on join column • Merge • Cost: external sort cost + merge cost • 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N • Optimized Case: • If we have enough memory, do final merge and join in same pass. This avoids final write pass from sort, and read pass from merge • Can we merge on 2nd pass? Only in #runs from 1st pass < B • #runs for R is M/B. #runs for S is N/B. • Total #runs ~~ (M+N)/B • Can merge on 2nd pass if M+N/B < B, or M+N < B2 • Cost: 3(M+N)

  27. Cost of Hash Join • Partitioning phase: read+write both relations  2(|R|+|S|) I/Os • Matching phase: read+write both relations |R|+|S| I/Os • Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass merge-sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?

  28. Summary: Hashing vs. Sorting • Sorting pros: • Good if input already sorted, or need output sorted • Not sensitive to data skew or bad hash functions • Hashing pros: • Often cheaper due to hybrid hashing • For join: # passes depends on size of smaller relation • Highly parallelizable

  29. Review Outline • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

  30. Review: Database Design • Requirements Analysis • user needs; what must database do? • Conceptual Design • high level descr (often done w/ER model) • Logical Design • translate ER into DBMS data model • Schema Refinement • consistency, normalization • Physical Design - indexes, disk layout • Security Design - who accesses what

  31. name cost pname age ssn age Policy Dependents Employees Review: the ER Model • Entities and Entity Set (boxes) • Relationships and Relationship sets (diamonds) • binary • n-ary • Key constraints (1-1,1-M, M-N, arrows on 1 side) • Participation constraints (bold for Total) • Weak entities - require strong entity for key

  32. ISA (`is a’) Hierarchies name ssn lot • attributes inherited. • If we declare A ISA B, every A entity is also considered to be a B entity. Employees • Overlap constraints: Can Simon be an Hourly_Emps as well as a Contract_Emps entity? (Allowed/disallowed) • Covering constraints: Does every Employees entity also have to be an Hourly_Emps or a Contract_Emps entity? (Yes/no) • Conversions between Relational schema <-> ER Diagram hours_worked hourly_wages ISA contractid Contract_Emps Hourly_Emps

  33. Review: Functional Dependencies • Properties of the real world • Decide when to decompose relations • Help us find keys • Help us evaluate Design Tradeoffs • Want to reduce redundancy, avoid anomalies • Want reasonable efficiency • Must avoid lossy decompositions • F+: closure, all dependencies that can be inferred from a set F • A+: attribute closure, all attributes functionally determined by the set of attributes A • G: minimal cover, smallest set of FDs such that G+ == F+

  34. Problems Due to R  W • Update anomaly: Can we modify W in only the 1st tuple of SNLRWH? • Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his or her rating? (or we get it wrong?) • Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! Hourly_Emps

  35. Review: Normal Forms • A property of a single relation • Tells us something about redundancy in reln • Reln R with FDs F is in BCNF if, for all X  A in F+ A  X (called a trivial FD), or X is a superkey for R. • Reln R with FDs F is in 3NF if, for all X  A in F+ A  X (called a trivial FD), or X is a superkey of R, or A is part of some candidate key (not superkey!) for R. (sometimes stated as “A is prime”)

  36. Review: Decomposition • If reln violates normal form, decompose • but must have lossless decomposition • Lossless decomposition: • decomposition of R into X and Y is lossless if and only if X  Y is a key for either X or Y • If W  Z holds over R and (W  Z) is empty, then decomposition of R into R-Z and WZ is loss-less. • Algorithm: • For each FD W  Z in R that violates normal form, decompose R into R-Z and WZ. Repeat as needed. • Order not important, but can produce very different results

  37. Review: Dependency Preservation • decompose too much, and it might be necessary to join tables to check FDs • decomposition of R into X and Y is dependencypreserving if (FX FY ) + = F + • FX is all FDs involving only attributes in X • FY is all FDs involving only attributes in Y • Not always obvious • ABC, A  B, B  C, C  A, decomposed into AB and BC. • Is this dependency preserving? Is C  A preserved? • note: F + contains F  {A  C, B  A, C  B}, so… • FAB contains A B and B  A; FBC contains B  C and C  B • So, (FAB FBC)+ contains C  A

  38. Minimal Cover for a Set of FDs • G: minimal cover, smallest set of FDs such that G+ == F+ • Closure of F = closure of G. • Right hand side of each FD in G is a single attribute. • If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. • Every FD in G is needed, and ``as small as possible’’ in order to get the same closure as F. • e.g., F+ = {A  B, B  C, C  A, B  A, C  B, A  C} • several minimal covers: {A  B, B  A, C  B, B  C} (AB + BC) • or {A  C, C  A, B  C, C  B} (AC + BC) • or {A  B, B  A, C  A, A  C} (AB + AC) • e.g., A  B, ABCD  E, EF  GH, ACDF  EG minimal cover: • A  B, ACD  E, EF  G and EF  H

  39. BCNF and Dependency Preservation • In general, there may not be a dependency preserving decomposition into BCNF. • But, you can always find dependency-preserving decomposition into 3NF • Top down: • Decompose until it is in 3NF • Compute minimal cover for FDs • If minimal cover contains a FD X  Y is not preserved, add reln XY • Bottom up: • Compute minimal cover • For each FD X  Y in minimal cover,create reln XY • Why does this work? Minimal cover doesn’t include redundant transitive dependencies, which don’t need to be preserved

  40. Questions? • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

  41. Thank you! • See you next Wednesday.

More Related