Review for Midterm 2

Review for Midterm 2 Shahram Ghandeharizadeh

Midterm 2 • Scheduled for April 30th • 4 papers • Variant indexes. • Access path selection. • Overview of query optimization. • Mining Association Rules. • Paper on cache management is not included because it was covered by your project. • Midterm 2 is worth 35% of your grade.

Variant Indexes • A read-mostly database that is updated infrequently. • Complex indexes to speedup queries. • Focuses on physical designs to enhance performance.

Key Observations: A handful of products, a PROD table with tens of rows. Many millions of rows for SALES tables. Example Data Warehouse SALES TIME Cid Pid Day Amt dollar_cost Unit_sales PROD Day Week Month Year Holliday Weekday Pid Name Size Weight Package_type

A B+-Tree on Major Holidays • A B+-tree index on different holidays of the SALES table. B+-tree Leaf page (Pres day, (1,2), (1, 3), (1, 4), (2,1), …. Value List Joe, Big Mac, Lab day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Mary, Fries, Pres day, … Kam, Happy Meal, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Bob, Big Mac, Pres day, …

A B+-Tree on Major Holidays • A B+-tree index on different holidays of the SALES table. B+-tree Leaf page (Pres day, (1,2), (1, 3), (1, 4), (2,1), …. Value List RID List Joe, Big Mac, Lab day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Mary, Fries, Pres day, … Kam, Happy Meal, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Bob, Big Mac, Pres day, …

Conjunctive Queries • Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES • With RID-Lists • Get the Value-List for “Big Mac” using the B+-tree, obtain RID-List1. • Get the Value-List for “President’s Day” using the B+-tree, obtain RID-List2. • Compute set-intersect of RID-List1 and RID-List2 • Count the number of RIDs in the intersection set. • Is there a better way? • Yes, use bit-maps and logical bit-wise operands.

Bitmap Indexes • Use a bitmap to represent the existence of a record with a certain attribute value. • Example: If a record has the indexed attribute value “Big Mac” then its corresponding entry in the bitmap is set to one. Otherwise, it is a zero.

A Bitmap • A Bitmap B is defined on T as a sequence of M bits. • For each row r with row number j that has the property P, we set bit j in B to one; all other bits are set to zero. • Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p. Pres Day, 0100001100111111110000011001….. Record 0

A B+-Tree on Major Holidays • A B+-tree index on different holidays of the SALES table. B+-tree Leaf page (Pres day, 01111111…. Joe, Big Mac, Lab day, … Jane, Happy Meal, Pres day, … Shideh, Happy Meal, Pres day, … Mary, Fries, Pres day, … Kam, Happy Meal, Pres day, … Harry, Big Mac, Pres day, … Henry, Big Mac, Pres day, … Bob, Big Mac, Pres day, …

Conjunctive Queries • Count number of Big Mac Sales on “President’s Day” assuming a B+-tree on product (pid) and day of SALES • With RID • Get the Value-List for “Big Mac” using the B+-tree, obtain RID-List1. • Get the Value-List for “President’s Day” using the B+-tree, obtain RID-List2. • Compute set-intersect of RID-List1 and RID-List2 • Count the number of RIDs in the intersection set. • With bit maps • Get the Value-List for “Big Mac” using the B+-tree, obtain bit-map1. • Get the Value-List for “President’s Day” using the B+-tree, obtain bit-map2. • Recall Existence Bitmap (EBM) identify rows that exist. • Let RES = logical AND of bit-map1, bit-map2, and EBM. • Count the number of bits set to one to identify how many Big Macs were sold on “President’s Day”.

Variant Indexes • Midterm 2 ignores: • MEDIAN, N-TILE, Column-Product as aggregates. • Section 5.

Access Path Selection • Formulates a cost prediction for each access plan, using the following cost formula: COST = Page fetches + W * (RSI Calls) • W is an adjustable weighting factor between I/O and CPU. • RSI calls is an approximation for CPU utilization. • Assumptions: • WHERE tree is considered to be in conjunctive normal form, • Every disjunct is called a boolean factor.

How? • Enumerating the different execution plans, • Estimate the cost of performing each plan, • Pick the cheapest plan. • Definition of cost is as follows: COST = Page fetches + W * (RSI Calls)

Clustered B+-Tree • A B+-tree on the gpa attribute 3.6 (2.3, (1, 1)) (3, (2,1)) (3.7, (3, 1)) (3.9, (4,1)) (2.5, (1,2)) (3.1, (2,2)) (3.8, (3,2)) (3.9, (4,2)) (2.8, (1,3)) (3.2, (2,3) (3.8, (3,3)) (4, (4,3)) (2.8, (1,4)) (3.8, (3,4)) (3.5, (2,4)) (4, (4,4)) Chad, 28, 2.3, LS Mary, 24, 3, ECE Bob, 21, 3.7, CS Chris, 22, 3.9, CS Kathy, 18, 3.8, LS Vera, 17, 3.9, EE James, 24, 3.1, ME Chang, 18, 2.5, CS Lam, 22, 2.8, ME Kane, 19, 3.8, ME Louis, 32, 4, LS Tom, 20, 3.2, EE Pat, 19, 2.8, EE Leila, 20, 3.5, LS Martha, 29, 3.8, CS Shideh, 16, 4, CS

Non-Clustered B+-Tree • A random I/O for every qualifying record 3.6 (2.3, (4, 2)) (3, (1,2)) (3.7, (1, 1)) (3.9, (4,1)) (2.5, (2,3)) (3.1, (3,3)) (3.8, (3,2)) (3.9, (2,4)) (2.8, (2,2)) (3.2, (1,3) (3.8, (2,1)) (4, (3,1)) (2.8, (3,4)) (3.8, (1,4)) (3.5, (4,3)) (4, (4,4)) Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS

Questions • How are relations and segments related? If slide #4, you state segments may contain more than one relation, and then the next bullet says “at most one relation per segment.”. What is going on?

Questions • How are relations and segments related? If slide #4, you state segments may contain more than one relation, and then the next bullet says “at most one relation per segment.”. What is going on? • Best clarified with an example: • Segment 1 may contain the Emp, Dept, and Revenues tables/relations. • The Emp relation can be assigned to Segment 1 only. It may NOT be assigned to both Segments 1 and 2.

NINDX(I), the number of pages in index I. • TCARD(T), the number of pages in the segment that hold tuples of relation T. Questions • The cost of retrieving a range of records from a clustered B+-tree: Should not this be (depth_of_B+-tree + F(pred) * TCARD) or something related to the depth of the tree rather than NINDX, since you only have to navigate through to the leaf nodes once and do a record scan once you’ve reached the correct leaf node?

NINDX(I), the number of pages in index I. • TCARD(T), the number of pages in the segment that hold tuples of relation T. Question • Should not the cost for a non-clustered B+-tree also involve the depth of the tree rather than NINDX?

NINDX(I), the number of pages in index I. • TCARD(T), the number of pages in the segment that hold tuples of relation T. Question • Should not the cost for a non-clustered B+-tree also involve the depth of the tree rather than NINDX? • It should include the depth of the tree. • NINDX must be included because the leaf pages of the B+-tree must be visited for the qualifying records.

NINDX(I), the number of pages in index I. • TCARD(T), the number of pages in the segment that hold tuples of relation T. Questions • Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S?

NINDX(I), the number of pages in index I. • TCARD(T), the number of pages in the segment that hold tuples of relation T. Questions • Why is the cost of a merge-scan NINDX(R) + NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit every page that contains tuples of R and S? • The merge-scan employs the sorted order of the entries in the leaf pages of the B+-tree index structures.

Overview of Query Optimization • Extends discussion to: • Correlation queries, use of outer-join to “flatten” nested queries. • Raises more questions than providing answers. • A good starting point for: • A practitioner who wants to build an optimizer for a relational DBMS. • A Ph.D. student interested in writing a dissertation in the area of query optimization techniques.

Mining Association Rules • Objective: Discover association Rule over basket data. • Motivation: valuable for cross-marketing and attached mailing applications. • Example: 98% of customers who purchase tires and auto accessories also get automotive services done. • Key contributions: • Fast algorithms: • Apriori, AprioriTid, and AprioriHybrid • Pay attention to terminology, definitions, and the general framework.

Database Systems • This course has introduced you to: • A storage manager and its use, BDB. • Spatial indexing, R-Trees. • Parallel DBMS. • Alternative technologies for applications that do not require ACID transactions. • Google FS, MapReduce, etc. • Optimization techniques for relational DBMSs. • Knowledge Discovery. • Efficient query processing techniques. • RAID and Use of flash memory in enterprises: • Steve Kleiman’s lecture.

Database Systems • Many important topics remain: • Data mining, Data cubes, Data visualization techniques. • Papers are from 1990s: Neumerous follow-on papers!

Review for Midterm 2