1 / 30

CS 345: Topics in Data Warehousing

CS 345: Topics in Data Warehousing. Tuesday, October 19, 2004. Review of Thursday’s Class. Bridge tables Hierarchies Multi-Valued Dimensions Extraction-Transformation-Load Data staging area vs. data warehouse Assigning surrogate keys Detecting changed rows

janus
Download Presentation

CS 345: Topics in Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 345:Topics in Data Warehousing Tuesday, October 19, 2004

  2. Review of Thursday’s Class • Bridge tables • Hierarchies • Multi-Valued Dimensions • Extraction-Transformation-Load • Data staging area vs. data warehouse • Assigning surrogate keys • Detecting changed rows • Detecting duplicate dimension rows

  3. Outline of Today’s Class • Database System Architecture • Memory management • Secondary storage (disk) • Query planning process • Joins • Nested Loop Join • Merge Join • Hash Join • Grouping • Sort vs. Hash

  4. Memory Management • Modern operating systems use virtual memory • Address space larger than physical main memory • Layer of indirection between virtual addresses and physical address • Physical memory serves as a cache for most recently accessed memory locations • Infrequently accessed memory locations written to disk when cache fills up • Memory transferred in large blocks called pages • Disk access: latency is large relative to throughput • Benefit from locality of reference • Transfer overheads amortized over many requests

  5. Memory Management • Database management systems (DBMSs) use a similar model to virtual memory • Data is stored persistently on disk • Main memory serves as cache for frequently accessed data • This memory cache is called the buffer pool • Data organized into blocks called pages • DBMSs override built-in virtual memory • Database operations are very data intensive • Careful management of disk access essential for good performance • DBMS has specialized knowledge of its data access patterns • Specialized memory management includes: • Prefetching pages likely to be needed in the future • Essential / frequently accessed pages pinned in memory • Multiple buffer pools with different parameters for different needs • Parameters: Page size, page replacement policy • Needs: Relations, indexes, temporary results, sort space

  6. Disk Access Time • Seek cost + rotational latency + transfer cost • Seek cost • Move the disk head to the right place • Physical machinery • Rotational latency • Wait for the disk to spin • Average wait = ½ rotation • Transfer cost • Read the data while the disk spins Disk drive Seek cost Rotational latency

  7. Random vs. Sequential I/O • Seek cost and rotational latency are overhead • Independent of the transfer size • ~ 20 ms on commodity disks • One large transfer is much cheaper than many small transfers • Random vs. Sequential I/O • I/O = Input/Output • Random I/O: Read/write an arbitrary page • Sequential I/O: Read/write the next page • Sequential I/O is 10-20x faster than random I/O • Avoiding random I/O will be a theme • Cost of sequential I/O is less • Prefetching easy when access patterns predictable

  8. Query Planning Process • Many different query plans are possible • Logically equivalent ways of answering the same query • Job of the query optimizer: choose the most efficient plan for a given query • Most efficient = minimize resource consumption • Resources = CPU, disk I/O • Consume less resources → faster query responses • Design of query optimizer • Rule-based • Apply heuristic transformations (e.g. push selections down) • Cost-based: • Estimate cost (resource usage) of each possible query plan • Choose the plan with lowest estimated cost • Cost estimation uses statistics collected by the optimizer • In practice, optimizers use a combination of the two • Search space of possible plans is too large to exhaustively explore • Heuristics guide which types of plans are considered for cost-based optimization

  9. SQL query parse parse tree Query rewriting statistics Best logical query plan Physical plan generation Best physical query plan execute result

  10. Join Processing • OLAP queries involve joins • Fact table joins to dimension tables • Primary-key / Foreign-key joins • Choice of join algorithm has large impact on query cost • Naïve join processing algorithmTo join relations A and B: • Generate “Cartesian product” of A and B • Each A tuple is combined with each B tuple • Apply selection conditions to each tuple in the Cartesian product • Problem: Intermediate result can be huge! • Relation A has 10 million rows • Relation B has 100,000 rows • A x B has 1 trillion rows (many terabytes of data) • Need more efficient, logically equivalent procedure

  11. Nested Loop Join (NLJ) R1 R2 • NLJ (conceptually) for each r  R1 do for each s  R2 do if r.C = s.C then output r,s pair

  12. Nested Loop Join • In theory, multi-way nested loop joins are possible • Joining n relations → nnested “for” loops • In practice, only binary (2-way) joins operators are implemented • Joining n relations → Cascade of n-1 binary joins • Nested-loop join terminology: • Outer input: scanned in outer loop of join • Inner input: scanned in inner loop of join C A B

  13. Analysis of Nested Loop Join • Outer has m tuples, inner has n tuples • CPU complexity = O(m*n) • For each of m outer tuples, compare with each of n inner tuples • Quadratic running time • What about I/O complexity? • Depends on amount of memory available • We’ll assume B tuples per page • Consider four scenarios • Use 2 pages of memory • Use (m/B + n/B) pages of memory • Use (n/B + 1) pages of memory • Use (k + 1) pages of memory • For k < m/B

  14. Nested Loop Join (2 pages) • One page for outer, one page for inner • NLJ algorithm: • Load 1st outer block into memory • Load all n/B blocks of inner, one after another • Load 2nd outer block into memory • Load all n/B blocks of inner, one after another • … • Load (m/B)th outer block into memory • Load all n/B blocks of inner, one after another • Total I/O cost = m/B * (n/B + 1)

  15. Nested Loop Join (m/B+n/B pages) • Load both inputs entirely into memory • All computation takes place in memory • No further paging necessary • Total I/O cost = m/B + n/B • Requires a lot of memory • Practical if both inputs are small

  16. Nested Loop Join (n/B + 1 pages) • Allocate 1 page for outer • Load entire inner into memory (n/B pages) • NLJ algorithm: • Load entire inner into memory • Load 1st outer block into memory • Inner is already in memory → no additional I/O needed • Load 2nd outer block into memory • Inner is already in memory → no additional I/O needed • … • Load (m/B)th outer block into memory • Inner is already in memory → no additional I/O needed • Total I/O cost = m/B + n/B • Same I/O cost as previous slide • Practical if inner input is small

  17. Nested Loop Join (k+1 pages) • Allocate 1 page for inner • Load k pages of outer into memory at a time • NLJ algorithm: • Load 1st group of k outer blocks into memory • Load all n/B blocks of inner, one after another • Load 2nd group of k outer block into memory • Load all n/B blocks of inner, one after another • … • Load (m/kB)th group of k outer blocks into memory • Load all n/B blocks of inner, one after another • Total I/O cost = m/kB * (n/B + k) • Reduction from 2-page case by a factor of k

  18. Summary of NLJ Analysis • CPU usage is O(m*n) • Quadratic in size of inputs • I/O usage depends on memory allocated • If one input is small: • Make the small input the inner • I/O complexity is O(m/B + n/B) • Linear in size of inputs • The minimum possible • Otherwise: • Make the small input the outer • Slightly better than the other way around • I/O complexity is O(m/kB * (n/B + k)) • When k+1 pages of memory are available • Approximately mn/kB2

  19. Improving on Nested Loops Join • For each outer tuple, we scan the entire inner relation to find matching tuples • Maybe we can do better by using data structures • Two main ideas: • Sorting • Hashing

  20. Sorting • Using sorting to improve NLJ • First we sort the inner relation • Then, for each outer tuple, find matching inner tuple(s) using binary search over sorted inner • CPU usage improves • O(mn) → O(m log n) + sort cost • Sub-quadratic • This strategy is indexed nested-loop join • To be covered in more detail on Thursday • Implementation details • Use B-Tree instead of binary tree as search structure • Larger branching factor → better I/O efficiency

  21. Merge Join • What if we sort both inputs instead of one? • No need to perform binary search for each outer tuple • Since outer tuples are also in sorted order, access to inner relation is structured • Binary search accesses pages in arbitrary order → random I/O • Merge join uses sequential I/O • CPU usage improves • O(mn) → O(m + n) + sort costs Random I/O for eachouter tuple Indexed NLJ Merge Join scan 1 3 2 1 2 3 4 5 6 7 1 2 3 4 5 6 7 5 2 7 3 5 1 4 4 5 1 2 3 4 4 5 5 5 7 scan scan

  22. External-Memory Sorting • How to efficiently sort large data sets? • Consider I/O complexity • Sort a relation with n tuples, B tuples per block • Case 1: Use n/B pages of memory • Case 2: Use k+1 pages of memory • Case 1: Use n/B pages of memory • Read the relation into memory • n/B sequential disk reads • Sort in main memory using quicksort • No disk access required • Write the sorted output back to disk • n/B sequential disk writes • Total: 2n/B disk I/Os (all sequential) • Requires lots of memory → only useful for small inputs

  23. External-Memory Sorting • Case 2: Use k pages of memory • k ≥ sqrt(n/B) • Two phases (run generation + merge) • Run generation • Repeat for each group of k blocks: • Read group of k blocks into memory • Sort using quicksort • Write sorted “run” back out to disk • Merge • Read 1st page from each run into memory • Merge to produce globally sorted output • Write output pages to disk as they are filled • As a page is exhausted, replace by next page in run • Each block read + written once per phase • Total disk I/Os: 4n/B Memory Disk Sorted Runs Final output

  24. Hashing • Store inner input in a hash table • Tuples grouped together into buckets using a hash function • Apply hash function to join attributes (called the key) to determine bucket • Idea: all records with the same key must be in the same bucket • For each outer tuple • Hash key to determine which bucket it goes in • Scan that bucket for matching inner tuples • CPU usage improves • O(mn) → O(m + n) • Assuming a good hash function with few collisions • Collision = tuples with different keys hash to same bucket • Only useful for equality joins • Not for join conditions like WHERE A.foo > B.bar • 99% of all joins are equality joins

  25. Hash Join Analysis • Hash join terminology • Inner input is called build input • Outer input is called probe input • Build a hash table based on one input, probe into that hash table using the other • Probe has m tuples, build has n tuples • Using (n/B + 1) pages of memory: • Allocate 1 page for probe input • Read build input into memory + create hash table • Uses n/B pages • Read 1st page of probe input • Hash each tuple and search bucked for matching build tuples • Requires no additional I/O • Repeat for 2nd and subsequent pages of probe input • I/O complexity = O(n/B + m/B) • Suitable when one input is small

  26. Partitioned Hash Join • Two-phase algorithm (Partition + Join) • Uses k blocks of memory • Phase 1: Partition • Performed separately for build and probe inputs • Read 1st page into memory • Assign each tuple to a partition 1..k using hash function • Each partition is allocated 1 block of memory • Tuples assigned to same partition are placed together • When the block fills up, write it to disk • Each partition stored in a contiguous region on disk • Repeat for 2nd and subsequent pages • Phase 2: Join phase • Join Partition 1 of probe input with Partition 1 of build input using ordinary hash join • Use a secondary hash function • Repeat with Partition 2 of each input, etc. • A form of recursion • Partitioning creates smaller problem instances

  27. Partition Phase Disk Memory Unsorted Input (Similar to merge phase of merge join, only backwards) Partitions

  28. Summary of Join Methods • When 1 relation fits entirely in memory, all join methods have linear I/O complexity • When both relations are large: • NLJ has quadratic I/O complexity • Merge join and hash join have linear I/O complexity • CPU complexity • Quadratic for NLJ • N log N for merge join • Linear for hash join • Rules of Thumb: • Use nested loop join for very small inputs • Use merge join for large inputs that are already sorted • Use hash join for large, unsorted inputs

  29. Query Plans with Multiple Joins • Two kinds of execution: • Pipelined • No need to store intermediate results on disk • Available memory must be shared between joins • Materialized • No memory contention between joins • Spend I/Os writing intermediate results to disk • Blocking operators • Sort, partition, build hash table • Pipelined execution is disallowed • Interesting orders • Sorted input → sorted output • NLJ: output is sorted by same key as outer input • Merge join: output is sorted by same key as inputs • Partitioned input → partitioned output • Hash join: output is partitioned by same key as inputs • Take advantage of interesting orders • Eliminate sorting / partitioning phases of later joins C A B

  30. Grouping • First partition tuples on grouping attributes • Tuples in same group placed together • Tuples in different groups separated • Then scan tuples in each partition, compute aggregate expressions • Two techniques for partitioning • Sorting • Sort by the grouping attributes • All tuples with same grouping attributes will appear together in sorted list • Hashing • Hash by the grouping attributes • All tuples with same grouping attributes will hash to same bucket • Sort / re-hash within each bucket to resolve collisions

More Related