520 likes | 636 Views
Cassandra and Sigmod contest . Cloud computing group Haiping Wang 2009-12-19. Outline. Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010. Cassandra overview. Highly scalable, distributed Eventually consistent
E N D
Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19
Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmodcontest 2010
Cassandra overview • Highly scalable, distributed • Eventually consistent • Structured key-value store • Dynamo + bigtable • P2P • Random reads and random writes • Java
C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Data Model Columns are added and modified dynamically ColumnFamily1 Name : MailListType : SimpleSort : Name KEY Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordListType : SuperSort : Time Column Families are declared upfront Name : aloha Name : dude C2 V2 T2 C6 V6 T6 SuperColumns are added and modified dynamically Columns are added and modified dynamically ColumnFamily3 Name : SystemType : SuperSort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List>
Cassandra API • Data structures • Exceptions • Service API • ConsistencyLevel(4) • Retrieval methods(5) • Range query: returns matching keys(1) • Modification methods(3) • Others
Partitioning and replication(1) • Consistent hashing • DHT • Balance • Monotonicity • Spread • Load • Virtual nodes • Coordinator • Preference list
Partitioning and replication(2) h(key1) 1 0 N=3 B h(key2) A C F E D 1/2 9
Data Versioning • Always writeable • Mulitple versions • put() return before all replicas • get() many versions • Vector clocks • Reconciliation during reads by clients
Vector clock • List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] • Use timestamp E.g. D([x,1]:t1,[y,1]:t2) • Remove the oldest version when reach a thresthold
Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version
Excution operations • Two strategies • A generic load balancer based on load balance • Easy ,not have to link any code specific • Directory to the node • Achieve lower latency
Object with vector clock Put() operation P1 w-1 responses coordinator P2 client PN-1
Cluster Membership • Gossip protocol • State disseminated in O(logN) rounds • Increase its heartbeat counter and send its list to another every T seconds • Merge operations
Failure • Data center(s) failure • Multiple data centers • Temporary failure • Permanent failure • Merkle tree
Boolom filter • a space-efficient probabilistic data structure • used to test whether an element is a member of a set • false positive
Compactions D E L E T E D K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted Sorted Sorted MERGE SORT Index File K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted Data File
Write Key (CF1 , CF2 , CF3) • Data size • Number of Objects • Lifetime Memtable ( CF1) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) FLUSH Memtable ( CF2) Data file on disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> Dedicated Disk K128 Offset K256 Offset K384 Offset Bloom Filter BLOCK Index <Key Name> Offset, <Key Name> Offset (Index in memory)
Read Client Result Query Cassandra Cluster Read repair if digests differ Closest replica Result Replica A Digest Query Digest Response Digest Response Replica B Replica C
Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010
Sigmod contest 2009 • Task overview • API • Data structure • Architecture • Test
Task overview • Index system for main memory data • Running on multi-core machine • Many threads with multiple indices • Serialize execution of user-specified transactions • Basic function exact match queries ,range queries , updates inserts , deletes
IdxState • Keep track of an index • Created openIndex() • Destroyed closeIndex() • Inherited by IdxStateType • Contains pointers pointing to • a hashtable • a FixedAllocator • a Allocator • a array with the type of action
Transactor • a HashOnlyGet object with typeTxnState
Allocator • Allocate the memory for the payloads • Use pools and linked list • Pool sized --the max length of payloadis 100 • The payloads with the same payload are in the same list
Unit Tests • three threads , run over three indices • the primary thread • create the primary index • inserts, deletes and accesses data in the primary index • the second thread • simultaneously runs some basic tests over a separate index • the third thread • ensure the transactional guarantees • Continuously queries the primary index
Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010
Task overview • Implement a simple distributed query executor with the help of the in-memory index • Given centralized query plans, translate them into distributed query plans • Given a parsed SQL query, return the right results • Data stored on disk, the indexes are all in memory • Measure the total time costs
SQL query form SELECT alias_name.field_name, ... FROM table_name AS alias_name,… WHERE condition1 AND ... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2
Tests • An initial computation • On synthetic and real-world datasets • Tested on a single machine • Tested on an ad-hoc cluster of peers • Passed a collection of unit tests , provided with an Amazon Web Services account of a 100 USD value
Benchmarks(stag1) • Assume a partition always cover the entire table, the data is not replicated. • Unit-tests • Benchmarks • On a single node, selects with an equal condition on the primary key • On a single node, selects with an equal condition on an indexed field • On a single node, 2 to 5 joins on tables of different size • On a single node, 1 join and a "greater than" condition on an indexed field • On three nodes, one join on two tables of different size, the two tables being on two different nodes
Benchmarks(stag2) • Tables are now stored on multiple nodes • Part of a table, or the whole table may be replicated on multiple nodes • Queries will be sent in parallel up to 50 simultaneous connections • Benchmarks • Selects with an equal condition on the primary key, the values being uniformly distributed • Selects with an equal condition on the primary key, the values being non-uniformly distributed • Multiple joins on tables separated on different nodes