Cassandra and Sigmod contest

Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmodcontest 2010

Cassandra overview • Highly scalable, distributed • Eventually consistent • Structured key-value store • Dynamo + bigtable • P2P • Random reads and random writes • Java

C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Data Model Columns are added and modified dynamically ColumnFamily1 Name : MailListType : SimpleSort : Name KEY Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordListType : SuperSort : Time Column Families are declared upfront Name : aloha Name : dude C2 V2 T2 C6 V6 T6 SuperColumns are added and modified dynamically Columns are added and modified dynamically ColumnFamily3 Name : SystemType : SuperSort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List>

Cassandra Architecture

Cassandra API • Data structures • Exceptions • Service API • ConsistencyLevel(4) • Retrieval methods(5) • Range query: returns matching keys(1) • Modification methods(3) • Others

Cassandra commands

Partitioning and replication(1) • Consistent hashing • DHT • Balance • Monotonicity • Spread • Load • Virtual nodes • Coordinator • Preference list

Partitioning and replication(2) h(key1) 1 0 N=3 B h(key2) A C F E D 1/2 9

Data Versioning • Always writeable • Mulitple versions • put() return before all replicas • get() many versions • Vector clocks • Reconciliation during reads by clients

Vector clock • List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] • Use timestamp E.g. D([x,1]:t1,[y,1]:t2) • Remove the oldest version when reach a thresthold

Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version

Excution operations • Two strategies • A generic load balancer based on load balance • Easy ,not have to link any code specific • Directory to the node • Achieve lower latency

Object with vector clock Put() operation P1 w-1 responses coordinator P2 client PN-1

Cluster Membership • Gossip protocol • State disseminated in O(logN) rounds • Increase its heartbeat counter and send its list to another every T seconds • Merge operations

Failure • Data center(s) failure • Multiple data centers • Temporary failure • Permanent failure • Merkle tree

Temporary failure

Merkle tree

Boolom filter • a space-efficient probabilistic data structure • used to test whether an element is a member of a set • false positive

Compactions D E L E T E D K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted Sorted Sorted MERGE SORT Index File K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Loaded in memory K1 Offset K5 Offset K30 Offset Bloom Filter Sorted Data File

Write Key (CF1 , CF2 , CF3) • Data size • Number of Objects • Lifetime Memtable ( CF1) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) FLUSH Memtable ( CF2) Data file on disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> Dedicated Disk K128 Offset K256 Offset K384 Offset Bloom Filter BLOCK Index <Key Name> Offset, <Key Name> Offset (Index in memory)

Read Client Result Query Cassandra Cluster Read repair if digests differ Closest replica Result Replica A Digest Query Digest Response Digest Response Replica B Replica C

Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010

Sigmod contest 2009 • Task overview • API • Data structure • Architecture • Test

Task overview • Index system for main memory data • Running on multi-core machine • Many threads with multiple indices • Serialize execution of user-specified transactions • Basic function exact match queries ,range queries , updates inserts , deletes

API

Record

HashTable

HashShared

TxnState

IdxState • Keep track of an index • Created openIndex() • Destroyed closeIndex() • Inherited by IdxStateType • Contains pointers pointing to • a hashtable • a FixedAllocator • a Allocator • a array with the type of action

Architecture

IndexManager

DeadLockDetector

Transactor • a HashOnlyGet object with typeTxnState

Allocator • Allocate the memory for the payloads • Use pools and linked list • Pool sized --the max length of payloadis 100 • The payloads with the same payload are in the same list

Unit Tests • three threads , run over three indices • the primary thread • create the primary index • inserts, deletes and accesses data in the primary index • the second thread • simultaneously runs some basic tests over a separate index • the third thread • ensure the transactional guarantees • Continuously queries the primary index

Outline • Cassandra • Cassandra overview • Data model • Architecture • Read and write • Sigmod contest 2009 • Sigmod contest 2010

Task overview • Implement a simple distributed query executor with the help of the in-memory index • Given centralized query plans, translate them into distributed query plans • Given a parsed SQL query, return the right results • Data stored on disk, the indexes are all in memory • Measure the total time costs

SQL query form SELECT alias_name.field_name, ... FROM table_name AS alias_name,… WHERE condition1 AND ... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2

Initialization phase

Connection phase

Query phase

Closing phase

Tests • An initial computation • On synthetic and real-world datasets • Tested on a single machine • Tested on an ad-hoc cluster of peers • Passed a collection of unit tests , provided with an Amazon Web Services account of a 100 USD value

Benchmarks(stag1) • Assume a partition always cover the entire table, the data is not replicated. • Unit-tests • Benchmarks • On a single node, selects with an equal condition on the primary key • On a single node, selects with an equal condition on an indexed field • On a single node, 2 to 5 joins on tables of different size • On a single node, 1 join and a "greater than" condition on an indexed field • On three nodes, one join on two tables of different size, the two tables being on two different nodes

Benchmarks(stag2) • Tables are now stored on multiple nodes • Part of a table, or the whole table may be replicated on multiple nodes • Queries will be sent in parallel up to 50 simultaneous connections • Benchmarks • Selects with an equal condition on the primary key, the values being uniformly distributed • Selects with an equal condition on the primary key, the values being non-uniformly distributed • Multiple joins on tables separated on different nodes

Cassandra and Sigmod contest

Cassandra and Sigmod contest

Presentation Transcript

Cassandra A. Smith

.NET and NoSQL Introducing Cassandra

Cassandra Training

Cassandra DB

BY: Dejanae and Cassandra

Cassandra walker

Cassandra + Hadoop

Cassandra Database Project

Cassandra Opikokew

Cassandra

SIGMOD 2006 PAKDD 2009

SIGMOD VIABILITY REVIEW

Cassandra O’Connor

Cassandra Clare

Report on SIGMOD 2005

Mastering Cassandra

Cassandra training, Cassandra Online Tutorials

Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassandra Training | Edureka