C-Store: A Column-oriented DBMS

C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen

Relational Database Attribute1 Attribute2 Attribute3 Record 1 Record 2 Record 3 e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)

Current DBMS -- “Row Store” Record 1 Record 2 Record 3 Record 4 E.g. DB2, Oracle, Sybase, SQLServer, …

Row Stores are Write Optimized(use white board) • Store fields in one record contiguously on disk • Use small (e.g. 4K) disk blocks • Use B-tree indexing • Align fields on byte or word boundaries • Assume shifting data values is costly • Transactions: write-ahead logging

Row Stores are Write Optimized • Can insert and delete a record in one physical write • Good for on-line transaction processing (OLTP) • But not for read mostly applications • Data warehouses • Customer Relationship Management (CRM) • Electronic library card catalogs • …

Column Stores

At 100K Feet…. • Read-optimized: • Periodically a bulk load of new data • Long period of ad-hoc queries • Benefit: • Ad-hoc queries read 2 columns out of 20 • Column store reads 10% of what a row store reads • Previous pioneering work: • Sybase IQ (early ’90s) • Monet (see CIDR ’05 for the most recent description)

C-Store Technical Ideas • Data storage: • Only materialized views (perhaps many) • Compress the columns to save space • No alignment • Big disk blocks • Innovative redundancy • Optimize for grid (cluster) computing • Focus on Sorting not indexing • Automatic physical DBMS design • Column optimizer and executor

How to Evaluate This Paper…. • None of the ideas in isolation merit publication • Judge the complete system by its (hopefully intelligent) choice of • Small collection of inter-related powerful ideas • That together put performance in a new sandbox

Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary

Data Model • Projection (materializedview): • some number of columns from a fact table • plus columns in a dimension table – with a 1-n join between Fact and Dimension table • (conceptually) no duplicate elimination • Stored in order of a storage key(s) • Note: base table is not stored anywhere

Example • Logical base tables: • EMP (name, age, salary, dept) • DEPT (dname, floor) • Example projections • EMP1 (name, age | age) • EMP2 (dept, age, DEPT.floor | DEPT.floor) • EMP3 (name, salary | salary) • DEPT1 (dname, floor | floor)

Optimize for Grid Computing • I.e. shared-nothing • Horizontal partitioning and intra-query parallelism as in Gamma • Paper talks about “Grid computers … may have tens to hundreds of nodes …”

Projection Detail #1 • Each projection is horizontally partitioned into “segment”s • Segment identifier • Unit of distribution and parallelism • Value-based partitioning, key range of sort key(s) • Column-wise store inside segment • Storage key: ordinal record number in segment • calculated as needed

Projection Detail #2 • Different encoding schemes for different columns • Depends on ordering and value distribution • Self-order, few distinct values:(value, position, num_entries) • Foreign-order, few distinct values:(value, bitmap), bitmap is run-length encoded • Self-order, many distinct values:block-oriented, delta value encoding • Foreign-order, many distinct values:gzip

Different Indexing

Big Disk Blocks • Tunable • Big (minimum size is 64K)

Reconstructing Base Table from Projections • Join Index: • Projection T1 has M segments, projection T2 has n segments • T1 and T2 are on same base table • Join index consists of M tables, one per T1 segment • Entry: segment ID and storage key of corresponding record in T2 • In general, needs multiple join indices for reconstructing a base table • Join index is costly to store and maintain • Each column expected to be in multiple projections • Reduce # of join indices

Innovative Redundancy • Hardly any warehouse is recovered by redo from log • Takes too long! • Store enough projections to ensure K-safety • Column can be in K different projections • Rebuild dead objects from elsewhere in the network

Automatic Physical DBMS Design • Accept a “training set” of queries and a space budget • Choose the projections and join indices auto-magically • Re-optimize periodically based on a log of the interactions

Operators • Decompress • Select: generate bitstring • Mask: bitstring+projection selected rows • Project: choose a subset of columns • Concat: combine multiple projections that are sorted in the same order • Sort • Permute: according to a join index • Join • Aggregation operators • Bitstring operators

Execution • Query plan: a tree of operators (data flow) • Leaf: accessing the data storage • Internal: calls “get_next” • Operators return 64KB blocks

Column Optimizer (discussion) • Cost-based estimation for query plan construction • Chooses projections on which to run the query • Cost model includes compression types • When to perform “mask” operator • Build in snowflake schemas • Which are simple to optimize without exhaustive search • Looking at extensions

Online Updates Are Necessary • Transactional updates are necessary even in read-mostly environment • Online updates for error corrections • Real-time data warehouses • Reduce the delay between OLTP system and warehouse towards zero

Solution – a Hybrid Store Write-optimized Column store (Batch rebuilder) Tuple mover Read-optimized Column store (What we have been talking about so far)

Write Store • Column store • Horizontally partitioned as the read store • 1:1 mapping between RS segments and WS segments • Storage keys are explicitly stored • Btree: sort key  storage key • No compression (the data size is small)

Handling Updates • Optimize read-only query: do not hold locks • Snapshot isolation • The query is run on a snapshot of the data • Ensure transactions related to this snapshot have already committed • Each WS site: insertion vector (with timestamps), deletion vector, (updates become insertions and detetions) • Maintain a high water mark and a low water mark of WS sites: • HWM: all transactions before HWM have committed • LWM: no records in read store are inserted before LWM • Queries can specify a time before HWM

HWM and epochs • TA: time authority updates the coarse timer (epochs)

Transactions • Undo from a log (that does not need to be persistent) • Redo by rebuild from elsewhere in the network

Tuple-Mover • Read RS segment • Combine WS segment into a new version of the RS segment, do not update in place • Record last move time for this segment in WS • Tlast_move LWM • Time authority will periodically sends out a new LWM epoch number

Current Performance Varying storage: • 100X popular row store in 40% of the space • 10X popular column store in 70% of the space • 7X popular row store in 1/6th of the space • Code available with BSD license

Summary • Column store is optimized for read queries • Cluster parallelism • Interesting data organization • Handling write queries

C-Store: A Column-oriented DBMS

C-Store: A Column-oriented DBMS

Presentation Transcript

Places around Town

Wallace Tree

Flute Physics

RECEIVE AND STORE STOCK

DBMS: Past, Present, and the Future

OBJECT ORIENTED ANALYSIS AND DESIGN

VERTEBRAL COLUMN RIBS STERNUM

Anesthesia for Neurosurgery

Outline

COS 240 Object-Oriented Languages 5.1 Object-Oriented Design

Object Oriented Programming

CHAPTER 12

Information Systems Development

Object Oriented Analysis and Design using UML

Chapter 4 Data-Oriented Models

Chapter 10: XML

Database Design Methodology

Basic SQL

Stream Store review and (GET) 100 items bonus pack