340 likes | 528 Views
C-Store: A Column-oriented DBMS. New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston). Extended for Big Data Reading Group Presentation by Shimin Chen . Relational Database. Attribute1. Attribute2. Attribute3. Record 1. Record 2. Record 3.
E N D
C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen
Relational Database Attribute1 Attribute2 Attribute3 Record 1 Record 2 Record 3 e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)
Current DBMS -- “Row Store” Record 1 Record 2 Record 3 Record 4 E.g. DB2, Oracle, Sybase, SQLServer, …
Row Stores are Write Optimized(use white board) • Store fields in one record contiguously on disk • Use small (e.g. 4K) disk blocks • Use B-tree indexing • Align fields on byte or word boundaries • Assume shifting data values is costly • Transactions: write-ahead logging
Row Stores are Write Optimized • Can insert and delete a record in one physical write • Good for on-line transaction processing (OLTP) • But not for read mostly applications • Data warehouses • Customer Relationship Management (CRM) • Electronic library card catalogs • …
At 100K Feet…. • Read-optimized: • Periodically a bulk load of new data • Long period of ad-hoc queries • Benefit: • Ad-hoc queries read 2 columns out of 20 • Column store reads 10% of what a row store reads • Previous pioneering work: • Sybase IQ (early ’90s) • Monet (see CIDR ’05 for the most recent description)
C-Store Technical Ideas • Data storage: • Only materialized views (perhaps many) • Compress the columns to save space • No alignment • Big disk blocks • Innovative redundancy • Optimize for grid (cluster) computing • Focus on Sorting not indexing • Automatic physical DBMS design • Column optimizer and executor
How to Evaluate This Paper…. • None of the ideas in isolation merit publication • Judge the complete system by its (hopefully intelligent) choice of • Small collection of inter-related powerful ideas • That together put performance in a new sandbox
Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary
Data Model • Projection (materializedview): • some number of columns from a fact table • plus columns in a dimension table – with a 1-n join between Fact and Dimension table • (conceptually) no duplicate elimination • Stored in order of a storage key(s) • Note: base table is not stored anywhere
Example • Logical base tables: • EMP (name, age, salary, dept) • DEPT (dname, floor) • Example projections • EMP1 (name, age | age) • EMP2 (dept, age, DEPT.floor | DEPT.floor) • EMP3 (name, salary | salary) • DEPT1 (dname, floor | floor)
Optimize for Grid Computing • I.e. shared-nothing • Horizontal partitioning and intra-query parallelism as in Gamma • Paper talks about “Grid computers … may have tens to hundreds of nodes …”
Projection Detail #1 • Each projection is horizontally partitioned into “segment”s • Segment identifier • Unit of distribution and parallelism • Value-based partitioning, key range of sort key(s) • Column-wise store inside segment • Storage key: ordinal record number in segment • calculated as needed
Projection Detail #2 • Different encoding schemes for different columns • Depends on ordering and value distribution • Self-order, few distinct values:(value, position, num_entries) • Foreign-order, few distinct values:(value, bitmap), bitmap is run-length encoded • Self-order, many distinct values:block-oriented, delta value encoding • Foreign-order, many distinct values:gzip
Big Disk Blocks • Tunable • Big (minimum size is 64K)
Reconstructing Base Table from Projections • Join Index: • Projection T1 has M segments, projection T2 has n segments • T1 and T2 are on same base table • Join index consists of M tables, one per T1 segment • Entry: segment ID and storage key of corresponding record in T2 • In general, needs multiple join indices for reconstructing a base table • Join index is costly to store and maintain • Each column expected to be in multiple projections • Reduce # of join indices
Innovative Redundancy • Hardly any warehouse is recovered by redo from log • Takes too long! • Store enough projections to ensure K-safety • Column can be in K different projections • Rebuild dead objects from elsewhere in the network
Automatic Physical DBMS Design • Accept a “training set” of queries and a space budget • Choose the projections and join indices auto-magically • Re-optimize periodically based on a log of the interactions
Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary
Operators • Decompress • Select: generate bitstring • Mask: bitstring+projection selected rows • Project: choose a subset of columns • Concat: combine multiple projections that are sorted in the same order • Sort • Permute: according to a join index • Join • Aggregation operators • Bitstring operators
Execution • Query plan: a tree of operators (data flow) • Leaf: accessing the data storage • Internal: calls “get_next” • Operators return 64KB blocks
Column Optimizer (discussion) • Cost-based estimation for query plan construction • Chooses projections on which to run the query • Cost model includes compression types • When to perform “mask” operator • Build in snowflake schemas • Which are simple to optimize without exhaustive search • Looking at extensions
Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary
Online Updates Are Necessary • Transactional updates are necessary even in read-mostly environment • Online updates for error corrections • Real-time data warehouses • Reduce the delay between OLTP system and warehouse towards zero
Solution – a Hybrid Store Write-optimized Column store (Batch rebuilder) Tuple mover Read-optimized Column store (What we have been talking about so far)
Write Store • Column store • Horizontally partitioned as the read store • 1:1 mapping between RS segments and WS segments • Storage keys are explicitly stored • Btree: sort key storage key • No compression (the data size is small)
Handling Updates • Optimize read-only query: do not hold locks • Snapshot isolation • The query is run on a snapshot of the data • Ensure transactions related to this snapshot have already committed • Each WS site: insertion vector (with timestamps), deletion vector, (updates become insertions and detetions) • Maintain a high water mark and a low water mark of WS sites: • HWM: all transactions before HWM have committed • LWM: no records in read store are inserted before LWM • Queries can specify a time before HWM
HWM and epochs • TA: time authority updates the coarse timer (epochs)
Transactions • Undo from a log (that does not need to be persistent) • Redo by rebuild from elsewhere in the network
Tuple-Mover • Read RS segment • Combine WS segment into a new version of the RS segment, do not update in place • Record last move time for this segment in WS • Tlast_move LWM • Time authority will periodically sends out a new LWM epoch number
Current Performance Varying storage: • 100X popular row store in 40% of the space • 10X popular column store in 70% of the space • 7X popular row store in 1/6th of the space • Code available with BSD license
Summary • Column store is optimized for read queries • Cluster parallelism • Interesting data organization • Handling write queries