Moshe Shadmon ScaleDB

Scaling MySQL in the Cloud Moshe Shadmon ScaleDB

Shared Disk vs. Shared Nothing Shared Disk Shared Nothing Masters Slaves

Shared Disk Advantages • Start small, grow incrementally • Scalable AND highly available • Add capacity on demand with zero downtime • Simplicity • No need to partition data • No need for master-slave

The Virtualized Cloud Database VM VM VM VM VM Server 2 Server 1 My SQL Server OSS DBMS OSS DBMS OSS DBMS OSS DBMS OSS DBMS ScaleDB ScaleDB ScaleDB ScaleDB ScaleDB OSS DBMS Storage Engine Local Disk Shared Storage Shared Disk Shared Nothing

ScaleDB As the Storage Engine MySql Database Management Level ScaleDB Storage Engine Storage Engine Level MySql Server

ScaleDB’s Internal Architecture ScaleDB Cluster Manager ScaleDB API Transaction Manager ScaleDB Node Threads Manager Lock Manager Global Lock Manager Local Lock Manager Log Manager Global Sync Manager Index Manager Data Manager Buffer Manager Global Recovery Manager Local Sync Coordinator Recovery Manager Storage Manager ScaleDB Storage System ScaleDB Storage Sysytem Cache & Storage Devices Cache & Storage Devices

Deploying ScaleDB Application Layer Application Database Layer (Physical or VM nodes) Node 1 Node 2 Node N DBMS DBMS DBMS … ScaleDB ScaleDB ScaleDB ScaleDB Cluster Manager ScaleDB Shared Storage Shared Storage Storage Layer

The Storage Engine • Pluggable Storage Engine • Transactional storage engine • Supports MySQL Storage Engine API • Reads/Writes done via network to a shared storage • Maintains a local cache • Local Lock Manager – manage locking at the node level • Connector to Cluster Manager – synchronize operations at a cluster level

The Cluster Manager • Distributed Lock Manager – manage cluster level locks • Locks can be held over any type of resource: • DBMS, Table, Partition, File, Block, Row etc. • Supports multiple lock modes: • Read, Read/Write, exclusive etc. • Synchronize state using messaging • Local Lock Manager – manage locks at a node level • Maintains locks at the node level • Synchronize state using shared memory • Identifies node failures and manage recovery

The Cluster Manager • Distributed Lock Manager • Synchronize conflicting processes between nodes in the cluster • Example: 2 nodes need to update the same resource at the same time. • The challenge: • Requests are done via the network – can be expensive: • Internal operations may be in nanoseconds , network operations are in milliseconds • The solution • Requests are send only when conflicts occur

The Storage • Independent storage nodes • Accessible via network • Each node has a Cache Layer and a Persistent Layer • Database nodes can force the write to disk based on transactional requirement • Data can be distributed over multiple storage nodes • Each Storage Node can be mirrored • Each Storage Node may have a Hot Backup Node

The Storage Node • Manage the data in cache and flush to disk when required. • Supports the storage engine calls for Read, Write, etc. • Supports pushed calls from storage engine such Count Rows, Search, etc. • Each node is a Linux machine. No need for Network File System (NFS). Storage Node Interface to Storage Cache Based On LRU Disks

Scaling the Storage Tier Database Layer (Physical or VM nodes) Node 1 Node 2 Node N DBMS DBMS DBMS … ScaleDB ScaleDB ScaleDB Local Cache Local Cache Local Cache ScaleDB Cluster Manager Storage Layer TCP/UDP TCP/UDP TCP/UDP TCP/UDP Shared Storage Shared Storage Shared Storage Shared Storage Global Cache Cache Cache Cache Cache

Global Cache • Guarantees cache coherency • Manages caching of shared data • Minimizes access time to data which is not in local cache and would otherwise be read from disk • Implements fast direct memory access over high-speed interconnects for all data blocks and types • Uses an efficient and scalable messaging protocol

HA of the Storage Tier Database Layer (Physical or VM nodes) Node 1 Node 2 Node N DBMS DBMS DBMS … ScaleDB ScaleDB ScaleDB ScaleDB Cluster Manager ScaleDB Storage Layer Hot Backup Mirrored Storage Shared Storage

Scaling the Storage Tier Database Layer (Physical or VM nodes) Node 1 Node 2 Node N DBMS DBMS DBMS … ScaleDB ScaleDB ScaleDB ScaleDB Cluster Manager Partitioned Storage Partitioned Storage Partitioned Storage Partitioned Mirrored Partitioned Mirrored Partitioned Mirrored Partitioned Hot Backup Partitioned Hot Backup Partitioned Hot Backup Partition 2 Partition Q Partition 1

Scaling the Storage Tier Node N • Read • From Local Cache • From Main Or Mirror • Get From Cache • Get From Storage • Write • To local cache • At end of transaction • multicast to main and mirror • optional acknowledgement: • after receive • after write Database Layer (Physical or VM nodes) MySQL ScaleDB Local Cache ScaleDB Cluster Manager Main Main Cache Cache Cache Cache Cache Storage Storage Storage Storage Storage Mirror Mirror

Traditional Query Processing What Were Yesterday Sales ? DBMS Server Storage Array Get The Sales Table Process Table Data Retrieve Entire Sales Table

ScaleDB Query Processing DBMS Server What Were Yesterday Sales ? Storage Nodes Get October 15 Sales Get October 15 Sales Get October 15 Sales Get October 15 Sales

Scaling the Storage Tier • Advantages • Parallel processing: • I/O calls are executed simultaneously on multiple Storage Nodes. • Logic pushed to storage layer: “SELECTcustomer_name from calls WHERE amount > 200” • Traditional approach – return all rows to the database • ScaleDB storage – return selected rows to the database • Leverage cache on multiple storage nodes • Storage layer can be expended without downtime • Data is Mirrored • Support for Hot-Backup • Low cost

High Availability • Failure of a node • Detected by the Cluster Manager • A surviving node is requested to undo uncommitted transactions • Failure of the Cluster Manager • Detected by the Standby Cluster Manager • Requests all nodes to undo uncommitted transactions • Failure of a Storage Node • Continue with a mirrored storage – or – • Use the Storage Node Log to recover

Performance / Tuning • Occurs when 2 or more nodes want the same resource at the same time • Types of Contention: • Read/Read contention – is never a problem because of the shared disk system • Read/Write contention – reader is requested to release the block and grant is provided to writer • Write/Read or Write/Write – • Writer sends block to the global cache layer, • Buffer invalidate message is send to the other nodes • Requestor receives the grant

Performance / Tuning • Fast Network between the nodes • 2 logical networks: • Between the database nodes and the Cluster Manager • Between the database nodes and the storage • Optimize Socket Receive Buffers ( 256 KB – 1MB ) • Partition requests to maintain locality of data • Send requests that update/query the same data to the same node • By Database • By Table • By Table with PK • Logic can change dynamically to adopt to changes • Changes in data distribution • Changes in user behaviors • Additional DBMS nodes

ScaleDB: Elastic/Enterprise Database

Value Proposition • Runs on low-cost cloud infrastructures (e.g. Amazon) • High-availability, no single point of failure • Dramatically easier set-up & maintenance • No partitioning/repartitioning • No slave and replication headaches • Simplified tuning • Scales up/down without interrupting your application • Lower TCO

Moshe Shadmon ScaleDB