240 likes | 384 Views
C-Store: Integrating Compression and Execution. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009. High Compressibility in Column Store. Each attribute is stored in a separate column. A Column Store can not only use traditional compression techniques
E N D
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009
High Compressibility in Column Store • Each attribute is stored in a separate column. • A Column Store can not only use traditional compression techniques • Dictionary encoding, Huffman Encoding, etc • But also can use column-oriented techniques • Run-Length Encoding
Benefits of Compression in DBMS • Reduces the Size of Data • Improve I/O Performance • Reducing seek times: data are stored nearer to each other. • Reducing transfer times: there is less data • Increasing buffer hit rate: buffer can hold larger fraction of data
How to Query a Compressed Column? • De-compress the data • Query the compressed column directly? • Run-Length Encoding
A Simple Example • In column C1, the value “42” appears 1000 times consecutively. • Assume C1 is compressed by Run-Length Encoding • Query: SUM(C1) • ==42 * 1000
History of Compression in DBMS • 80s: • Focus on compression ratio • 90s: • CPU cost of compressing/decompressing should be less than the savings of reducing the size of data. • Now: CPU speed increases much faster than memory speed and disk speed. • Light-weight Compression is good
Reducing CPU cost on compressed data(Graefe and Shapiro, 1991) • Lazy Decompression • Data is decompressed only if needed to be operated on. • Query the compressed data directly • Exact-match Comparison, Natural Join • If the constant portion of the predicate is compressed in the same way as the data
New Work in C-Store • Simultaneously apply an operation on multiple values in a single column. • Introduces a novel architecture for passing compressed data between query operators. • Minimizes operator code complexity • While maximizes chances for direct operation on compressed data.
Review: Basic Concepts of C-Store • A logical table is physically represented as a set of projections. • Each projection consists of a set of columns • Columns are stored separately, along with a common sort order defined by SORT KEY. • Each column appears in at least one projection. • A column can have different sort orders if it is stored in multiple projections.
An example of C-Store Projection • LINEITEM(shipdate, quantity, retflag, suppkey | shipdate, quantity, retflag) • First sorted by shipdate • Second sorted by quantity • Third sorted by retflag • Sorting increases locality of data. • Favors Compression Techniques such as Run-Length Encoding
C-Store Operators vs. Relational Operators • Selection • Produce bitmaps that can be efficiently combined. • Mask • Materialize a set of values from a column and a bitmap. • Permute • Reorder a column using a join index. • Projection • Is free to project a column. • Two columns in the same order can be concatenated for free. • Join • Produces positions rather than values.
Compressed Query Execution:Two Classes for Each New Compression Technique • Compression Block • Encapsulates an intermediate representation for compressed data. • DataSource operator • Reads in compressed pages from disk and converts them into compression blocks.
A Compression Block • contains a buffer of the column data in compressed format • Provides an API that allows the buffer to be accessed in several ways.
Accessing Properties of Compression Block • isOneValue() • Returns whether or not the block contains just one value (and many positions for that value). • isValueSorted() • Returns whether or not the block’s values are sorted. • isPosContig() • Returns whether or not the block a consecutive subeset of a column.
Properties of Compression Block:for Various Encoding Schemes.
Iterator Access: where decompression cannot be avoided. • getNext() • Transiently decompresses the next value in the compression block • Returns that value along with the position in the uncompressed column. • asArray() • Decompresses the entire compression block • And returns a pointer to an array of data in the uncompressed column type.
Block Information Methods (1):Getting Data without Decompression • For Run-Length Encoding • A compression block consists of a single RLE triple of the form (value, start_pos, run_length) • getSize(): • Returns run_length; • getStartValue(): • Returns value; • getEndPosition(): • Returns (start_pos + run_length -1);
Block Information Methods (2):Getting Data without Decompression • For bitmaps • A compression block is a consecutive subset of a bitmap for a single value. • getSize() : • Returns the number of onbits in the block (i.e., a bit string). • getStartValue() : • Returns the value with which the bit string is associated. • getEndPosition() : • Returns the position of the last on bit in the bit string.
Compression-Aware Optimization • Natural Join • An input column is compressed by Run-Length Encoding, • The other input column is uncompressed • Do the join directly • Reduce the number of operations by a factor ofk, where k is the run-length of the RLE triple. • Count
Summary: Integrating Compression and Execution • Operate directly on compressed data whenever possible • Using compression blocks as an intermediate representation of data. • Degenerate to a lazy decompression scheme when decompression cannot be avoided • Iterating through values in a compression block. • Reduce query executor complexity • By abstracting general properties of compression techniques.
References • Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS , VLDB, 2005. (http://db.csail.mit.edu/projects/cstore/vldb.pdf) • Daniel J. Abadi, Samuel R. Madden, and Miguel C. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems.In SIGMOD, June, 2006, Chicago, USA. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf