1 / 24

C-Store: Integrating Compression and Execution

C-Store: Integrating Compression and Execution. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009. High Compressibility in Column Store. Each attribute is stored in a separate column. A Column Store can not only use traditional compression techniques

Download Presentation

C-Store: Integrating Compression and Execution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

  2. High Compressibility in Column Store • Each attribute is stored in a separate column. • A Column Store can not only use traditional compression techniques • Dictionary encoding, Huffman Encoding, etc • But also can use column-oriented techniques • Run-Length Encoding

  3. Benefits of Compression in DBMS • Reduces the Size of Data • Improve I/O Performance • Reducing seek times: data are stored nearer to each other. • Reducing transfer times: there is less data • Increasing buffer hit rate: buffer can hold larger fraction of data

  4. How to Query a Compressed Column? • De-compress the data • Query the compressed column directly? • Run-Length Encoding

  5. A Simple Example • In column C1, the value “42” appears 1000 times consecutively. • Assume C1 is compressed by Run-Length Encoding • Query: SUM(C1) • ==42 * 1000

  6. History of Compression in DBMS • 80s: • Focus on compression ratio • 90s: • CPU cost of compressing/decompressing should be less than the savings of reducing the size of data. • Now: CPU speed increases much faster than memory speed and disk speed. • Light-weight Compression is good

  7. Reducing CPU cost on compressed data(Graefe and Shapiro, 1991) • Lazy Decompression • Data is decompressed only if needed to be operated on. • Query the compressed data directly • Exact-match Comparison, Natural Join • If the constant portion of the predicate is compressed in the same way as the data

  8. New Work in C-Store • Simultaneously apply an operation on multiple values in a single column. • Introduces a novel architecture for passing compressed data between query operators. • Minimizes operator code complexity • While maximizes chances for direct operation on compressed data.

  9. Review: Basic Concepts of C-Store • A logical table is physically represented as a set of projections. • Each projection consists of a set of columns • Columns are stored separately, along with a common sort order defined by SORT KEY. • Each column appears in at least one projection. • A column can have different sort orders if it is stored in multiple projections.

  10. An example of C-Store Projection • LINEITEM(shipdate, quantity, retflag, suppkey | shipdate, quantity, retflag) • First sorted by shipdate • Second sorted by quantity • Third sorted by retflag • Sorting increases locality of data. • Favors Compression Techniques such as Run-Length Encoding

  11. C-Store Operators vs. Relational Operators • Selection • Produce bitmaps that can be efficiently combined. • Mask • Materialize a set of values from a column and a bitmap. • Permute • Reorder a column using a join index. • Projection • Is free to project a column. • Two columns in the same order can be concatenated for free. • Join • Produces positions rather than values.

  12. Join over Two Columns: An Example

  13. Compressed Query Execution:Two Classes for Each New Compression Technique • Compression Block • Encapsulates an intermediate representation for compressed data. • DataSource operator • Reads in compressed pages from disk and converts them into compression blocks.

  14. A Compression Block • contains a buffer of the column data in compressed format • Provides an API that allows the buffer to be accessed in several ways.

  15. Accessing Properties of Compression Block • isOneValue() • Returns whether or not the block contains just one value (and many positions for that value). • isValueSorted() • Returns whether or not the block’s values are sorted. • isPosContig() • Returns whether or not the block a consecutive subeset of a column.

  16. Properties of Compression Block:for Various Encoding Schemes.

  17. Iterator Access: where decompression cannot be avoided. • getNext() • Transiently decompresses the next value in the compression block • Returns that value along with the position in the uncompressed column. • asArray() • Decompresses the entire compression block • And returns a pointer to an array of data in the uncompressed column type.

  18. Block Information Methods (1):Getting Data without Decompression • For Run-Length Encoding • A compression block consists of a single RLE triple of the form (value, start_pos, run_length) • getSize(): • Returns run_length; • getStartValue(): • Returns value; • getEndPosition(): • Returns (start_pos + run_length -1);

  19. Block Information Methods (2):Getting Data without Decompression • For bitmaps • A compression block is a consecutive subset of a bitmap for a single value. • getSize() : • Returns the number of onbits in the block (i.e., a bit string). • getStartValue() : • Returns the value with which the bit string is associated. • getEndPosition() : • Returns the position of the last on bit in the bit string.

  20. Compression-Aware Optimization • Natural Join • An input column is compressed by Run-Length Encoding, • The other input column is uncompressed • Do the join directly • Reduce the number of operations by a factor ofk, where k is the run-length of the RLE triple. • Count

  21. Summary: Integrating Compression and Execution • Operate directly on compressed data whenever possible • Using compression blocks as an intermediate representation of data. • Degenerate to a lazy decompression scheme when decompression cannot be avoided • Iterating through values in a compression block. • Reduce query executor complexity • By abstracting general properties of compression techniques.

  22. References • Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS , VLDB, 2005. (http://db.csail.mit.edu/projects/cstore/vldb.pdf) • Daniel J. Abadi, Samuel R. Madden, and Miguel C. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems.In SIGMOD, June, 2006, Chicago, USA. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf

More Related