C-Store: How Different are Column-Stores and Row-Stores?

C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009

Row and Column Stores

Limitation of Current Comparison • Assuming a row store does not use any column-oriented physical design. • Columns of the same logical tuple is traditionally stored together. • In fact it is possible to simulate a column store in a row store.

Simulating a Column Store in a Row Store • Vertical Partitioning • Indexing Every Column • Materialized views • C-Tables • Vertical Partitioning + Run-Length Encoding

Simulation 1: Vertical Partitioning • Each column makes its own table • N columns result in N tables, each table stores (tuple-id, attr) pairs. • Need to do tuple reconstruction • In C-store, columns are sorted in the same order. • In MonetDB, base columns are kept in insertion order, updates only happen to cracker columns. • In this “simulation” scenario, an integer “tuple-id” column is associated to each column. • Use hash join on “tuple-id” column to reconstruct tuples.

Two Problems of Vertical Partitioning • It requires the “tuple-id” column to be explicitly stored in each column. • Waste space and disk bandwidth. • Most row-stores store a relatively large header on every tuple • further waste space.

Tuple Header • A tuple header provides metadata about the tuple. • For example, in Postgres, each tuple contains a 27 bytes header including information such as • Insert transaction timestamp. • Number of attributes in the tuple. • NULL flags • Length of tuple header

Tuple Header in a Column Store • A column-store puts tuple header in separate columns • Some information in the tuple header can be removed • For example, in MonetDB, each logical column stores (key, attr) pairs. • Hence, the number of attributes is always 2. • And there is no need for NULL flags, say we simply don’t store the tuple with NULL value.

Simulation 2: Indexing Every Column • Base relations are stored using a standard, row-oriented physical design. • And an additional secondary B-Tree is built on every column of every table. • Join columns on tuple-id. • This approach answers queries by reading values directly from the indexes.

One Problems of Indexing Every Column • If a column has no predicate on it, this approach requires • its index to be scanned to extract the needed values. • One optimization is to create indexes with composite keys SELECT AVG(salary) FROM emp WHERE age > 40; If we have a composite index with an (age, salary) key, we can answer this query directly from the index.

Simulation 3: Materialized Views • For every query in the benchmark workload, create an optimal set of materialized views. • Involve only the columns needed to answer queries. • Do not pre-join columns from different tables in these materialized views. • Problem of materialized views. • It requires query workload to be known in advance.

Comparison Results by Abadi et al. (SIGMOD’2008) • Using a simplified version of TPC-H called the Star Schema Benchmark (SSBM). • Major Results: • None of the three attemps to simulate a column store in a row store are particularly effective. • The Materialized View approach is the best • The Vertical Partitioning approach is the moderate • The Indexing Every Column approach is the worst.

Relevant Factors for Successful Simulation • For successfully simulating a column store in a row store, we may have to • Store tuple headers separately • Use virtual tuple-id to do tuple reconstruction (i.e., joins). • Maintain heap files (for base relations) in guaranteed position order.

Simulation 4: the C-Table Approach(By Bruno, CIDR’2009) The main idea is to extend the Vertical Partitioning approach to explicitly the Run-Length Encoding (RLE) of tuple values. A C-Table is a sequence of (f, v, c) triples: f is the starting tuple-id or position. v is the data value. c is the count (length). First sort by a, then b, and finally c.

An Interesting Property of C-Table • For any pair of tuples t1 and t2, possibly on different c-tables, • the range [f1, f1 + c1 -1] and [f2, f2 + c2 -1] do not partially overlap, • i.e., they are either disjoint, or the one associated with the column deeper in the sort order is included in the other. • This property allows us to use specific query rewriting to combine information from different c-tables to answer queries.

Query Rewriting for C-Tables The Schema D1: (lineitem | l_shipdate, l_suppkey)

Comparison Results by Bruno. (CIDR’2009) • Using TPC-H. • Using for comparison a loose lower bound on any column store implementation • Manually compute how many pages in disk need to be read by any column store execution plan, • And measure the time taken to just read the pages. • Results: • The C-Table approach can compete with column stores. • Vision: • Row-stores should be able to incorporate column specific optimizations.

Column-Specific Optimizations • Late materialization • Construct tuples as late as possible. • Block iteration • Multiple-tuples-at-a-time • Column-specific compression • Run-Length Encoding • Invisible join on star schema (New) • Like semi-join in distributed DBMS.

Schema of the SSBM Benchmark

An Example Query for Illustrating Invisible Join

Invisible Join: First Phase Each predicate is applied to the appropriate dimension table to extract a list of dimension table keys that satisfy the predicate.

Invisible Join: Second Phase Each hash table is used to extract the positions of tuples in the fact table that satisfy the corresponding predicate.

Invisible Join: Third Phase The third phase uses the list of satisfying positions P in the fact table to get foreign key values and hence needed data values from the corresponding dimension table.

Which Column-Specific Optimization is most significant? • Late materialization • Improves performance by a factor of 3. • Block iteration • Improves performance by 5% - 50%. • Column-specific compression • Improves performance by a factor of 2 on average • Improves performance by a factor of 10 on queries that access sorted data. • Invisible join on star schema • Improves performance by 50% - 70%.

Open Question • Building a complete row-store that can transform into a column-store on workloads where column-stores perform well.

References • Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-Stores vs. Row-Stores: How Different Are They Really?. In SIGMOD, 2008. • Allison L. Holloway, David J. DeWitt. Read-optimized databases, in depth. In VLDB, 2008. • N Bruno. Teaching an Old Elephant New Tricks. In CIDR’2009.

C-Store: How Different are Column-Stores and Row-Stores?

C-Store: How Different are Column-Stores and Row-Stores?

Presentation Transcript

Price Busters Furniture

Grocery Store Survey

Places around Town

Four Aces

Lecture 4 Trade Area Delimitation and Analysis

Best In France Case study : LIDL

THIS

Chapter 7 Storage

What is Marketing…??

There are 230,000 Retail Establishments 30,000 Major Chain Stores 487 Power Centres which represents some 162M square f

200

CORPORATE EDUCATION

Jeo-pardy

Chap3 Morgan Importing (p. 98)

THIS

THIS

Joel Resnick Flemington New Jersey

THIS

THIS

THIS

Stream Store review and (GET) 100 items bonus pack