200 likes | 337 Views
Presentation on Compression Store. Dec 1, 2011. Agenda. Overview of RAM cloud Overview of HANA Overview of CS- B+Tree compression utilities for in-memory compressed blocks Prototype on HDFS FUSE filesystem and working of LZ77. RAM Cloud.
E N D
Presentation on Compression Store Dec 1, 2011
Agenda • Overview of RAM cloud • Overview of HANA • Overview of CS- B+Tree • compression utilities for in-memory compressed blocks • Prototype on HDFS • FUSE filesystem and working of LZ77
RAM Cloud • Problem: Disk based system are not able to meet the need of large scale Web Applications • Solution: (3 solutions) - New approaches to disk based storage - Replacing disk with flash memory devices. - RAM Cloud
RAM Cloud contd … • RAM Cloud stores the information in main memory • RAM Cloud uses 100, 1000 of servers to create large-scale storage system • Reduced Latency : Ram data is readily accessible, low latency than disk • Need for durability: Ram cloud uses replication, backup to provide durability
RAM Cloud Concept (Scaling-Performance) • Applications • Generating web pages • Enforcing business rules • Storage • Shared storage for applications • Traditionally RDBMS, Files • New storage models, Big Table, memcached
RAM Cloud different from memcache • Not a cache but the data store so no caching policy • Disk is used as backup device • Reduced latencies I/O is order of micro sec, whereas disk it is msec. • RAM Cloud typically will have around 64 GB of DRAM in each server
Motivation for RAM Cloud(Application/Storage) • Motivation (Applications) • Single database cannot meet the needs popular web app • Data partitioned into multiple databases to mee throughput requirement • With increase of workload adhoc techniques have to be adopted. • Case study (Facebook) • Based on Aug 2009 Facebook deploys 2000 memcached servers on top of 4000 database servers to increase the throughput • DB is MySQL, 4000 DB servers, 50% of total DB servers
Motivation (Storage) • New storage systems like BigTable, Dynamo, PNUTS solves scalability issues. • These technique give up the benefits of traditional DBs [ Greedy techniques] • RAM Cloud provides generic storage solution • Disk density has increase but access rate of the disk has not increased much • Disk going towards archival role [ Tapes] • Accessing large blocks of data from disk is fine • However, the size of large blocks is increasing, 100K was large size in 80s now it is 10 MB, which is likely to increase in future. • Only video data gets a better throughput with large disk blocks.
Caching • The effect of caching is getting diluted with bigger data kept in RAM • Applications like Facebook have little or no locality due to complex links between data. • caching systems do not offer guaranteed performance it changes based on hit and miss ratio of cache. RAM Cloud though costs more but offers guaranteed performance. • Latency issues • DB queries that do not meet disk layout have to do numerous seeks ( Iterative searches like tree walks are very costly, disk access) • Specialized database architecture like array store, column store, stream processing engine have been developed to reduce latency for particular set of query • RAM Cloud offers a generic layout as it provides low latency. • Flash based technique • RAM Cloud can be made of FLASH, cheaper solution than DRAM • However, DRAM gives giher throughput compared to Flash so much better • New technlogy like phase change memeory that might be better than FLASH.
Issues and disadvantage of RAM Cloud • Low latency RPC is required • Network switches are bottleneck (switch introduce delays) • OS overhead of processing the interrupts for network stack • Virtualization architecture makes it more slow. • Durability needs to be provided • Replicate all objects to memories of several machine. • Buffered logging could be one of the techniques where data can be written to disk as a log (Log structured filesystem based recovery) • Data Model for RAM based storage (3 aspects) • Nature of data objects like Blobs or a fix size record, data structure in C++ Java • How the basic objects are organized into higher level objects • Key value stores do not provide any aggregation • RDBMS, Rows are organized in to Table, indices can be built on table to enhance queries. • Mechanism for naming and indexing • RDBMS Rows are identified by a value, Primary key • Key value pair each object is identified by a key • RDBMS has scaling issues, no database that scales to 1000 servers, key value stores highly scalable but not feature rich like RDBMS.
Contd .. • Proposed data model: Intermediate/Hybrid approach where the data type is BLOB (do not impose structure on data but support indexing), BLOB object are kept in tables. • Data placement: The new data that will be created needs to be placed in one of the servers in cluster. • Small tables should be kept in single server • Big table needs to be evenly balanced across servers. • Addressing Concurrency, transaction and consistency • ACID properties provided by RDBMS are not scalable • Bigtable does not involve transaction including more than one ROW. • Dynamo does not guarantee immediate and consistent update of replicas. • Due to reduced latency the execution of a transaction won’t be prohibitive , thus ACID could be scalable on RAM system ?
HANA • Key enablers of SAP in-memory computing database • Large amount of addressable memory + growing processor cache. From 16GB DIMM to 32 GB DIMM, 24 MB to 30 MB processor cache. • Faster processing, clock cycles, Intel's Hyper Threading architecture from 8 cores to 10 cores • Faster processor interconnect between processor Intel’ QuickPath
SAP in-memory computing database • Row-column store provide ACID guarantee • Calculation and planning engine with data repository • Data management services that include MDX and SQL interfaces
HANA based on H. Plattner paper • Complex business requirement require transaction systems OLTP • Analytical and financial applications require OLAP systems • OLTP and OLAP systems are based on relational theory • OLTP system tuples are arranged in row, which are stored in blocks • Indexing allows faster access to tuples • Data access becomes slower, with increased number of tuples • OLAP is organized in star schemas • Optimization is to compress column with help o
Contd .. • Column store is best suited for modern CPU • Enterprise application are memory bound • The vertical column compression has a better compression ratio than horizontal row compression • Row store cannot compete with column store • Based on “Cache Sensitive Search on B+ Tree” that stores all the children of a node contiguously, the child nodes are accessed by storing the address of first child and increasing the offset to get to subsequent child nodes.