1 / 28

CS 345: Topics in Data Warehousing

CS 345: Topics in Data Warehousing. Tuesday, November 9, 2004. Review of Thursday’s Class. Index Selection Selecting Views and Indexes Together Storage Systems Mirroring, Striping, and Parity RAID Levels. Outline of Today’s Class. Eliminating Bottlenecks Multi-channel RAID

shea
Download Presentation

CS 345: Topics in Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 345:Topics in Data Warehousing Tuesday, November 9, 2004

  2. Review of Thursday’s Class • Index Selection • Selecting Views and Indexes Together • Storage Systems • Mirroring, Striping, and Parity • RAID Levels

  3. Outline of Today’s Class • Eliminating Bottlenecks • Multi-channel RAID • Clusters of Processors • Shared-Memory (SMP) • Shared-Disk • Shared-Nothing • Horizontal Partitioning • Partitioning and Clustered Indexes • Multi-Dimensional Clustering • Clustered File Organization • Integer Mapped Tables

  4. Eliminating Storage Bottlenecks • Storage systems have various components • Improving performance of one component through parallelism can make another component the bottleneck • Initially, disk transfer rate is limiting factor • Solution: Add more disks • Eventually, I/O bus becomes saturated • Solution: Add new I/O buses (and controllers) • Multi-channel RAID Disk System Bus I/O Bus CPU DiskController Disk

  5. Multi-channel RAID Disk I/O Bus DiskController System Bus Disk CPU Disk I/O Bus DiskController Disk

  6. Eliminating Processing Bottlenecks • Three architectures to add processing power • Shared-memory clusters • Also called Symmetric Multi-Processing (SMP) • One computer, many processors • Each processor has its own cache • All processors share same memory, disks • Shared-disk clusters • Computers sharing the same storage system • Each processor has its own cache, memory • All processors share same disks • Shared-nothing clusters • Independent computers connected on a LAN • Each processor has its own cache, memory, disks

  7. Parallel Architectures Disk Disk SymmetricMultiprocessor(SMP) CPU CPU CPU CPU CPU CPU Disk Disk Shared Disk Cluster Shared Nothing Cluster CPU Disk CPU Disk Disk CPU Disk CPU Disk Disk Disk CPU CPU

  8. Pros and Cons • Shared-memory (SMP) machines • Advantages: • Easy to administer (one big server) • Avoids transmitting data over network • Disadvantages: • Single point of failure (not fault tolerant) • Limited scalability • Fixed set of options (2, 4, 8, 16, 32, or 64 processors) • Cost increases dramatically at higher end • Shared-nothing clusters • Advantages: • Most cost-efficient choice for large systems • Continuous scalability • Easy to make fault tolerant through redundancy • Disadvantages: • Communication costs can be high • Shared-disk clusters • Intermediate between SMPs and shared-nothing clusters • Not very popular

  9. Horizontal Partitioning • Divide rows of a table into partitions based on value of some attribute(s) • Rows from same partition are stored together • Rows from different partitions are separated • Two kinds of partitioning • Hash partitioning • Rows divided into partitions using a hash function • Range partitioning • Each partition holds a range of attribute values

  10. Horizontal Partitioning Example Partition 1 (Age 20-29) Partition 2 (Age 30-39) Partition 3 (Age 40-49)

  11. Partitioning and Shared-Nothing • Store each partition on a different node • Partitioned join • Join two tables that are partitioned on the join key • Requires local operations only • No data needs to be shipped around Fact TablePartition 1Dimension TablePartition 1 Fact TablePartition 2Dimension TablePartition 2 Fact TablePartition 3Dimension TablePartition 3

  12. Partitioning and Shared-Nothing • Joins with repartitioning • Necessary when one or both inputs are not partitioned on the join key • Repartitioned join • Step 1: Partition each input on join key • Step 2: Assign each partition to a node • Step 3: Send rows to the proper nodes • Step 4: Perform local join

  13. Partitioning and Data Warehouses • Shared-nothing data warehouse • Partition fact tables across cluster nodes • Replicate dimension tables across cluster nodes • Replicated dimension tables • Dimension tables are small • Storing multiple copies of them is cheap • No communication needed for parallel joins • Partitioned fact tables • Fact tables are big • Process queries in parallel for each partition • Divide the work among the nodes in the cluster • One big dimension • Sometimes one dimension table is quite big (e.g. customer) • Can partition the big dimension table • Partition fact table on key of big dimension

  14. Partitioning and Shared-Disk / SMP • Shared-nothing cluster • Each disk can be accessed by only one node • Typically: Each partition is stored on a different node • Avoid unnecessary network communication • Shared-disk / Shared-memory cluster • Any node can access any disk • Typically: All partitions are striped across all disks • Achieve maximum I/O parallelism CPU CPU CPU CPU CPU CPU

  15. Partition Pruning • Often some partitions can be excluded entirely based on query filters • Table is partitioned on Age (20-29, 30-39, 40-49) • Query says “WHERE Age < 35” • No need to scan Age 40-49 partition • This is called partition pruning • Effect of partition pruning on I/O parallelism • If each disk stores a single partition: • Partition pruning reduces I/O parallelism • Disks for pruned partitions are idle • If all partitions are striped across all disks: • Maximum I/O parallelism regardless of pruning

  16. Manageability and Partitioning • Indexing partitioned tables • Local index • Indexes the data from a single partition • Stored with that partition • Global index • Indexes the data from all partitions • Could be stored anywhere, or replicated • Reducing load time via partitioning • Often fact tables are partitioned on Date • Also indexes, aggregate tables, etc. • Newly loaded records go into the last partition • Only indexes & aggregates for that partition need to be updated • All other partitions remain unchanged • Expiring old data • Often older data is less useful / relevant for data analysts • To reduce data warehouse size, old data is often deleted • If data is partitioned on date, simply delete the oldest partition

  17. Partitioning and Clustered Indexes • Recall: only one clustered index per table • Rows physically ordered based on clustered index search key • Queries filtering on that search key have very good I/O performance • Sequential access • Relevant records packed together into few pages • Queries filtering on other attributes have poor I/O performance • Unless only a few rows need to be retrieved • Idea: Combine partitioning and clustering • Partition on attribute A • Each partition is clustered on attribute B • Reasonably good I/O performance for queries that filter on either A or B

  18. Partitioning & Clustered Indexes Partitioned on A,Clustered on B Clustered on A • Clustered index on A • “A=10” tuples are contiguous on disk • “B=5” tuples are spread out • Filtering on A = Excellent • Filtering on B = Poor • Partitioned on A with clustered index on B • “A=10” tuples are confined to one partition • “B=5” tuples are in a few contiguous blocks (one per partition) • Filtering on A = Very Good • Filtering on B = Very Good A=10 B=5

  19. Multi-Dimensional Clustering • Goal: structured I/O access for multiple attributes • Similar to the last idea of combining partitioning and clustering • Divide data into “chunks” by grouping on multiple dimensions • For example, partition on Age, State, Gender • One chunk is Age=24, State=CA, Gender=F • Records from each chunk are stored together on disk in one or more fixed-size blocks • No block can contain data from more than one chunk • Some wasted space due to partially-filled blocks • Construct “block indexes” with one index entry per block • Instead of one index entry per RID • Block index can be built on single or multiple attributes

  20. Multi-Dimensional Clustering • 4 different Age values • 4 different State values • 16 chunks in all Some space is wasted due to partially filled blocks

  21. MDC Pros and Cons • Advantages of Multi-Dimensional Clustering • Block index is smaller than RID index • One row per block not one row per record • Records with same value are packed together in a relatively small number of blocks • Decent I/O performance • Easy to prefetch • Some of the benefits of clustered indexes, but for multiple dimensions • Disadvantages of Multi-Dimensional Clustering • Sparsity leads to space blow-up • If each block holds only a few records, lots of wasted space • MDC is similar to MOLAP storage format • Similar pros and cons

  22. Avoiding Sparsity in MDC • Trade-off in choosing block size • Big block size → Fewer blocks are needed → Block indexes are smaller • Small block size → Less wasted space due to partially-filled blocks • Block size should be at least as large as page size • Choose block size such that most chunks fill at least one block • Avoiding sparsity • Chunks are less than one page → MDC works poorly • Don’t choose too many clustering attributes • Avoid clustering attributes with high cardinality

  23. Clustered File Organization • Clustered file organization • Another physical design option offered by some DBMSs • Used primarily for “master-detail” data • Not commonly used with star schemas • How it works • R and S are two relations sorted on a common join key • Tuples for R and S are interleaved on the same pages on disk • Each R tuple is followed by all joining S tuples • Conceptually, similar to materialized view • Computing the join of R and S is very fast • Scanning R or S alone is slower because irrelevant data (tuples form the other relation) is mixed in with relevant data

  24. Clustered File Organization Standard File Organization Relation R Relation S Clustered File Organization

  25. Integer Mapped Tables • Dimension tables are often wide • Large number of attributes • Attributes often consist of descriptive text • When dimensions also have many rows, query performance can be affected • Joining to big dimension tables can be expensive • How to reduce the width of the dimension table? • Break into several mini-dimensions • Move some attributes to an outrigger table • “Integer mapping” (not standard terminology)

  26. Integer Mapped Tables • Construct auxiliary mapping table for each descriptive text column • Mapping table has schema (int_value, real_value) • Assign unique integer value to each text value • Replace long text columns with short integer values in dim. table • Integers serve as “pointers” to actual data values in mapping table • Use a view to provide user-friendly dimensions • Join physical dimension table to mapping table • Replace integer values with actual descriptive values • Result looks like ordinary dimension with no integer mapping

  27. Integer Mapped Tables Integer mapped dimension table Ordinary dimension table Mapping tables Long values in mapping tables Non-mappeddimensioncolumns Non-mappeddimensioncolumns Long columnsto be mapped Long valuesreplaced byintegers

  28. Pros and Cons of Integer Mapping • Pros and Cons • Thinner dimension table → Better query performance • Requires extra join to mapping table → Worse query performance • Extra complexity (but can be hidden behind view) • Whether query performance is improved depends on the data characteristics • Some queries may be faster while others are slower • Mostly useful for: • Very large dimension tables • Rarely-queried columns • Also useful for especially large data types • E.g. Image, embedded XML file

More Related