Final Exam Review

Final Exam Review Lecture 31

Administrivia • Office hours 1:15 – 2:15 today • also available via e-mail: ebenh@eecs.berkeley.edu • TAs will have extra office hours, • see class web page, news group • Final Exam May 16 8-11 a.m. • Location: 55 Warren • Closed book, 2 pages of notes, both sides • IEEE will provide pastries and juice starting at 7:30!

Final Exam Topics • Up to midterm 1 (25%) • Relational Model & Query Languages (Roth) • Relational Algebra and Calculus • SQL • Database Implementation (Haber) • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 (25%) • Query Execution • Relational Operators (Haber) • Sorting (Haber) • Joining (Haber) • Query Optimization (Roth) • Since midterm 2 (48%) • Database Design (Haber) • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery (Roth) • Guest Lectures (2%)

Why are databases interesting? • Theoretical foundation • Modelling structure of information • Relations: sets of identically structured tuples • Design constraints: FDs and correct decompositions • Formal query languages • Algebra: operators on relations, return relations • Calculus: declarative specification of query result • Practical application of theory • Using computer structures • pages, files, memory, buffer pools, indexes • ACID properties (xacts, concur. control, recovery) • Reasonable efficiency

Review Outline • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

DBMS components • Talks to DBMS to manage data for a specific task • -> e.g. app to withdraw/deposit money or provide a history of the account Database application Query Optimization and Execution • Figures out the best way to answer a question • -> There is always more than 1 way to skin a cat…! • Provides generic ways to combine data • -> Do you want a list of customers and accounts or the total account balance of all customers? Relational Operators • Provides efficient ways to extract data • -> Do you need 1 record or a bunch? Access Methods • Makes efficient use of RAM • -> Think 1,000,000 simultaneous requests! Buffer Management • Makes efficient use of disk space • -> Think 300,000,000 accounts! Disk Space Management DB

The Storage Hierarchy Smaller, Faster • Main memory (RAM) for currently used data. • Disk for the main database (secondary storage). • Tapes for archiving older versions of the data (tertiary storage). Bigger, Slower Source: Operating Systems Concepts 5th Edition

Arm movement Disks are slow. Why? Transfer time • Time to access (read/write) a disk block: • seek time (moving arms to position disk head on track) • rotational delay (waiting for block to rotate under head) • transfer time (actually moving data to/from disk surface) Seek time Rotational delay

Disk Space Manager • Lowest layer of DBMS software manages space on disk (using OS file system or not?). • Higher levels call upon this layer to: • allocate/de-allocate a page • read/write a page • Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! • Responsibility of disk space manager. • Higher levels don’t know how this is done, or how free space is managed. • Though they may make performance assumptions! • Hence disk space manager should do a decent job.

DB Buffer Management in a DBMS Page Requests from Higher Levels • Buffer pool information table contains: <frame#, pageid, pin_count, dirty> BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated by replacement policy

Buffer Management • Keeps a group a disk pages in memory • Records whether each is pinned • What happens when all pages pinned? • Whan happens when a page is unpinned? • Replacement • When all frames used, but not pinned, and new page requested? • How is the replaced page chosen? • Least Recently Used (LRU) • Most Recently Used (MRU) • Clock • Advantages? Disadvantages?

What is in Database Pages? • Database contains files, which are made up of… • Pages, which are made up of… • Records, which are made up of… • Fields, which hold single values.

Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) Pointer to start of free space N 16 24 20 N . . . 2 1 # slots SLOT DIRECTORY How are records/pages organized? • depends on whether fields variable, or fixed length • In Minibase, array of type/offsets, followed by data. • depends on whether records variable, fixed length. • Minibase: slot array at beginning of page, records compacted at end of page. F1 F2 F3 F4 Array of Field Offsets

Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY How are files organized? • Unordered Heap File: chained directory pages, containing records that point to data pages. • Other possibilities: sorted files, clustered indexes, unclustered index + heap file • Many tradeoffs between them

B: The number of data pages R: Number of records per page F: Fanout of B-Tree S: Time required for equality search * Don’t Use Index I/O Cost of Operations

Indexes • Can be used to store data records (alt 1), or be an auxillary data structure that referrs to existing file of records (alt 2, 3) • Many types of index (B-Tree, Hash Table, R-Tree, etc.) • How do you choose the right index? • Difference between clustered and unclustered indexes? Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records

Review: Query Processing • Queries start out as SQL • Database translates SQL to one or more Relational Algebra plans • Plan is a tree of operations, with access path for each • Access path is how each operator gets tuples • If working directly on table, can use scan, index • Some operators, like sort-merge join, or group-by, need tuples sorted • Often, operators pipelined, getting tuples that are output from earlier operators in the tree • Database estimates cost for various plans, chooses least expensive

Cost of Operations • Selections • Projections • Sorting, a.k.a. Order By • Removing duplicates, a.k.a. Select Distinct • Joins

Selections: “age < 20”, “fname = Bob”, etc • No index • Do sequential scan over all tuples • Cost: N I/Os • Sorted data • Do binary search • Cost: log2(N) I/Os • Clustered B-Tree • Cost: 2 or 3 to find first record + 1 I/O for each #qualifying pages • Unclustered B-Tree • Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple • Clustered Hash Index • Cost: ~1.2 I/Os to find bucket, all tuples inside • Unclustered Hash Index • Cost: ~1.2 I/Os to find bucket, + ~1 I/O for each matching tuple

Projection • Expensive when eliminating duplicates • Can do this via: • Sorting: cost no more than external sort • Cheaper if you project columns in initial pass, since more projected tuples fit in each page. • Hashing: build a hash table, duplicates will end up in the same bucket

Sorting • External Merge Sort • Minimum amount of memory: 3 pages • Initial runs of 3 pages • Then 2-way merge of sorted runs (2 pages for inputs, one for outputs) • #of passes: 1 + log2(N/3) • With more memory, fewer passes • With B pages, #of passes: 1 + log(B-1)(N/B) • I/O Cost = 2N * (# of passes) • Using B+ Trees for Sorting • Idea: • Retrieve records in order by traversing leaf pages. • Is this a good idea? Cases to consider: • B+ tree is clusteredGood idea! • B+ tree is not clusteredCould be a very bad idea! • I/O Cost • Clustered tree: ~ 1.5N • Unclustered tree: 1 I/O per tuple, worst case!

Remove duplicates with Hashing • Idea: • Many ops don’t need the data ordered • e.g.: removing duplicates in DISTINCT • e.g.: finding matches in JOIN • Good enough to match all tuples with equal values • Hashing does this! • And may be cheaper than sorting! (Hmmm…!) • But how to do it for data sets bigger than memory?? • If we can hash in two passes -> cost is 4N • How big of a table can we hash in two passes? • B-1 “partitions” result from Phase 0 • Each should be no more than B pages in size • 2 passes possible if table smaller than B(B-1) i.e.: can hash a table of size N pages in about √N space • Note: assumes hash function distributes records evenly! • Have a bigger table? Recursive partitioning!

Sorting vs Hashing • Based on our simple analysis: • Same memory requirement for 2 passes • Same IO cost • Digging deeper … • Sorting pros: • Great if input already sorted (or almost sorted) • Great if need output to be sorted anyway • Not sensitive to “data skew” or “bad” hash functions • Hashing pros: • Highly parallelizable • Can exploit extra memory to reduce # IOs

Nested Loops Joins • R, with M pages, joins S, with N Pages • Nested Loops • Simple nested loops • Insanely inefficient M + PR*M*n • Paged nested loops – only 3 pages of memory • M + M*N • Blocked nested loops – B pages of memory • M + M/(B-2) * N • If M fits in memory (B-2), cost only M + N • Index nested loops • M + PR*M* index cost • Only good in M verysmall

Sort-Merge Join • Simple case: • sort both tables on join column • Merge • Cost: external sort cost + merge cost • 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N • Optimized Case: • If we have enough memory, do final merge and join in same pass. This avoids final write pass from sort, and read pass from merge • Can we merge on 2nd pass? Only in #runs from 1st pass < B • #runs for R is M/B. #runs for S is N/B. • Total #runs ~~ (M+N)/B • Can merge on 2nd pass if M+N/B < B, or M+N < B2 • Cost: 3(M+N)

Cost of Hash Join • Partitioning phase: read+write both relations  2(|R|+|S|) I/Os • Matching phase: read+write both relations |R|+|S| I/Os • Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass merge-sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?

Summary: Hashing vs. Sorting • Sorting pros: • Good if input already sorted, or need output sorted • Not sensitive to data skew or bad hash functions • Hashing pros: • Often cheaper due to hybrid hashing • For join: # passes depends on size of smaller relation • Highly parallelizable

Review: Database Design • Requirements Analysis • user needs; what must database do? • Conceptual Design • high level descr (often done w/ER model) • Logical Design • translate ER into DBMS data model • Schema Refinement • consistency, normalization • Physical Design - indexes, disk layout • Security Design - who accesses what

name cost pname age ssn age Policy Dependents Employees Review: the ER Model • Entities and Entity Set (boxes) • Relationships and Relationship sets (diamonds) • binary • n-ary • Key constraints (1-1,1-M, M-N, arrows on 1 side) • Participation constraints (bold for Total) • Weak entities - require strong entity for key

ISA (`is a’) Hierarchies name ssn lot • attributes inherited. • If we declare A ISA B, every A entity is also considered to be a B entity. Employees • Overlap constraints: Can Simon be an Hourly_Emps as well as a Contract_Emps entity? (Allowed/disallowed) • Covering constraints: Does every Employees entity also have to be an Hourly_Emps or a Contract_Emps entity? (Yes/no) • Conversions between Relational schema <-> ER Diagram hours_worked hourly_wages ISA contractid Contract_Emps Hourly_Emps

Review: Functional Dependencies • Properties of the real world • Decide when to decompose relations • Help us find keys • Help us evaluate Design Tradeoffs • Want to reduce redundancy, avoid anomalies • Want reasonable efficiency • Must avoid lossy decompositions • F+: closure, all dependencies that can be inferred from a set F • A+: attribute closure, all attributes functionally determined by the set of attributes A • G: minimal cover, smallest set of FDs such that G+ == F+

Problems Due to R  W • Update anomaly: Can we modify W in only the 1st tuple of SNLRWH? • Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his or her rating? (or we get it wrong?) • Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! Hourly_Emps

Review: Normal Forms • A property of a single relation • Tells us something about redundancy in reln • Reln R with FDs F is in BCNF if, for all X  A in F+ A  X (called a trivial FD), or X is a superkey for R. • Reln R with FDs F is in 3NF if, for all X  A in F+ A  X (called a trivial FD), or X is a superkey of R, or A is part of some candidate key (not superkey!) for R. (sometimes stated as “A is prime”)

Review: Decomposition • If reln violates normal form, decompose • but must have lossless decomposition • Lossless decomposition: • decomposition of R into X and Y is lossless if and only if X  Y is a key for either X or Y • If W  Z holds over R and (W  Z) is empty, then decomposition of R into R-Z and WZ is loss-less. • Algorithm: • For each FD W  Z in R that violates normal form, decompose R into R-Z and WZ. Repeat as needed. • Order not important, but can produce very different results

Review: Dependency Preservation • decompose too much, and it might be necessary to join tables to check FDs • decomposition of R into X and Y is dependencypreserving if (FX FY ) + = F + • FX is all FDs involving only attributes in X • FY is all FDs involving only attributes in Y • Not always obvious • ABC, A  B, B  C, C  A, decomposed into AB and BC. • Is this dependency preserving? Is C  A preserved? • note: F + contains F  {A  C, B  A, C  B}, so… • FAB contains A B and B  A; FBC contains B  C and C  B • So, (FAB FBC)+ contains C  A

Minimal Cover for a Set of FDs • G: minimal cover, smallest set of FDs such that G+ == F+ • Closure of F = closure of G. • Right hand side of each FD in G is a single attribute. • If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. • Every FD in G is needed, and ``as small as possible’’ in order to get the same closure as F. • e.g., F+ = {A  B, B  C, C  A, B  A, C  B, A  C} • several minimal covers: {A  B, B  A, C  B, B  C} (AB + BC) • or {A  C, C  A, B  C, C  B} (AC + BC) • or {A  B, B  A, C  A, A  C} (AB + AC) • e.g., A  B, ABCD  E, EF  GH, ACDF  EG minimal cover: • A  B, ACD  E, EF  G and EF  H

BCNF and Dependency Preservation • In general, there may not be a dependency preserving decomposition into BCNF. • But, you can always find dependency-preserving decomposition into 3NF • Top down: • Decompose until it is in 3NF • Compute minimal cover for FDs • If minimal cover contains a FD X  Y is not preserved, add reln XY • Bottom up: • Compute minimal cover • For each FD X  Y in minimal cover,create reln XY • Why does this work? Minimal cover doesn’t include redundant transitive dependencies, which don’t need to be preserved

Questions? • Up to midterm 1 • Relational Model & Query Languages • Relational Algebra and Calculus • SQL • Database Implementation • Disks, buffers, files • Indexes: B-Trees, Hash Indexes • Between midterm 1 & midterm 2 • Query Execution • Relational Operators • Sorting • Joining • Query Optimization • Since midterm 2 • Database Design • The ER Model • Functional Dependencies & Normalization • Transactions, Concurrency Control, & Recovery

Thank you! • See you next Wednesday.

Final Exam Review