1 / 52

Cassandra and Sigmod contest

Cassandra and Sigmod contest . Cloud computing group Haiping Wang 2009-12-19. Outline. Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010. Cassandra overview. Highly scalable, distributed Eventually consistent

libby
Download Presentation

Cassandra and Sigmod contest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

  2. Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmodcontest 2010

  3. Cassandra overview • Highly scalable, distributed • Eventually consistent • Structured key-value store • Dynamo + bigtable • P2P • Random reads and random writes • Java

  4. C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Data Model Columns are added and modified dynamically ColumnFamily1 Name : MailListType : SimpleSort : Name KEY Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordListType : SuperSort : Time Column Families are declared upfront Name : aloha Name : dude C2 V2 T2 C6 V6 T6 SuperColumns are added and modified dynamically Columns are added and modified dynamically ColumnFamily3 Name : SystemType : SuperSort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List>

  5. Cassandra Architecture

  6. Cassandra API • Data structures • Exceptions • Service API • ConsistencyLevel(4) • Retrieval methods(5) • Range query: returns matching keys(1) • Modification methods(3) • Others

  7. Cassandra commands

  8. Partitioning and replication(1) • Consistent hashing • DHT • Balance • Monotonicity • Spread • Load • Virtual nodes • Coordinator • Preference list

  9. Partitioning and replication(2) h(key1) 1 0 N=3 B h(key2) A C F E D 1/2 9

  10. Data Versioning • Always writeable • Mulitple versions • put() return before all replicas • get() many versions • Vector clocks • Reconciliation during reads by clients

  11. Vector clock • List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] • Use timestamp E.g. D([x,1]:t1,[y,1]:t2) • Remove the oldest version when reach a thresthold

  12. Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version

  13. Excution operations • Two strategies • A generic load balancer based on load balance • Easy ,not have to link any code specific • Directory to the node • Achieve lower latency

  14. Object with vector clock Put() operation P1 w-1 responses coordinator P2 client PN-1

  15. Cluster Membership • Gossip protocol • State disseminated in O(logN) rounds • Increase its heartbeat counter and send its list to another every T seconds • Merge operations

  16. Failure • Data center(s) failure • Multiple data centers • Temporary failure • Permanent failure • Merkle tree

  17. Temporary failure

  18. Merkle tree

  19. Boolom filter • a space-efficient probabilistic data structure • used to test whether an element is a member of a set • false positive

  20. Compactions D E L E T E D K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted Sorted Sorted MERGE SORT Index File K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted Data File

  21. Write Key (CF1 , CF2 , CF3) • Data size • Number of Objects • Lifetime Memtable ( CF1) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) FLUSH Memtable ( CF2) Data file on disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> Dedicated Disk K128 Offset K256 Offset K384 Offset Bloom Filter BLOCK Index <Key Name> Offset, <Key Name> Offset (Index in memory)

  22. Read Client Result Query Cassandra Cluster Read repair if digests differ Closest replica Result Replica A Digest Query Digest Response Digest Response Replica B Replica C

  23. Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010

  24. Sigmod contest 2009 • Task overview • API • Data structure • Architecture • Test

  25. Task overview • Index system for main memory data • Running on multi-core machine • Many threads with multiple indices • Serialize execution of user-specified transactions • Basic function exact match queries ,range queries , updates inserts , deletes

  26. API

  27. Record

  28. HashTable

  29. HashShared

  30. TxnState

  31. IdxState • Keep track of an index • Created openIndex() • Destroyed closeIndex() • Inherited by IdxStateType • Contains pointers pointing to • a hashtable • a FixedAllocator • a Allocator • a array with the type of action

  32. Architecture

  33. IndexManager

  34. DeadLockDetector

  35. Transactor • a HashOnlyGet object with typeTxnState

  36. Allocator • Allocate the memory for the payloads • Use pools and linked list • Pool sized --the max length of payloadis 100 • The payloads with the same payload are in the same list

  37. Unit Tests • three threads , run over three indices • the primary thread • create the primary index • inserts, deletes and accesses data in the primary index • the second thread • simultaneously runs some basic tests over a separate index • the third thread • ensure the transactional guarantees • Continuously queries the primary index

  38. Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010

  39. Task overview • Implement a simple distributed query executor with the help of the in-memory index • Given centralized query plans, translate them into distributed query plans • Given a parsed SQL query, return the right results • Data stored on disk, the indexes are all in memory • Measure the total time costs

  40. SQL query form SELECT alias_name.field_name, ... FROM table_name AS alias_name,… WHERE condition1 AND ... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2

  41. Initialization phase

  42. Connection phase

  43. Query phase

  44. Closing phase

  45. Tests • An initial computation • On synthetic and real-world datasets • Tested on a single machine • Tested on an ad-hoc cluster of peers • Passed a collection of unit tests , provided with an Amazon Web Services account of a 100 USD value

  46. Benchmarks(stag1) • Assume a partition always cover the entire table, the data is not replicated. • Unit-tests • Benchmarks • On a single node, selects with an equal condition on the primary key • On a single node, selects with an equal condition on an indexed field • On a single node, 2 to 5 joins on tables of different size • On a single node, 1 join and a "greater than" condition on an indexed field • On three nodes, one join on two tables of different size, the two tables being on two different nodes

  47. Benchmarks(stag2) • Tables are now stored on multiple nodes • Part of a table, or the whole table may be replicated on multiple nodes • Queries will be sent in parallel up to 50 simultaneous connections • Benchmarks • Selects with an equal condition on the primary key, the values being uniformly distributed • Selects with an equal condition on the primary key, the values being non-uniformly distributed • Multiple joins on tables separated on different nodes

More Related