Scalable , Consistent, and Elastic Database Systems for Cloud Platforms

Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu Sponsors:

Web replacing Desktop Sudipto Das {sudipto@cs.ucsb.edu}

Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu}

Cloud computing • Computing infrastructureand solutions delivered as a service • Industry worth USD150 billion by 2014* • Contributors to success • Economies of scale • Elasticity and pay-per-use pricing • Popular paradigms • Infrastructure as a Service (IaaS) • Platform as a Service (PaaS) • Software as a Service (SaaS) *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm Sudipto Das {sudipto@cs.ucsb.edu}

Databases for cloud platforms • Data is central to applications • DBMSs are mission critical component in cloud software stack • Manage petabytes of data, drive revenue • Serve a variety of applications (multitenancy) • Data needs for cloud applications • OLTP systems: store and serve data • Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu}

Application landscape • Social gaming • Rich content and mash-ups • Managed applications • Cloud application platforms Sudipto Das {sudipto@cs.ucsb.edu}

Challenges for OLTP systems • Scalability • While ensuring efficient transaction execution! • Lightweight Elasticity • Scale on-demand! • Self-Manageability • Intelligence without a human controller! Sudipto Das {sudipto@cs.ucsb.edu}

Two approaches to scalability • Scale-up • Preferred in classical enterprise setting (RDBMS) • Flexible ACID transactions • Transactions access a single node • Scale-out • Cloud friendly (Key value stores) • Execution at a single server • Limited functionality & guarantees • No multi-row or multi-step transactions Sudipto Das {sudipto@cs.ucsb.edu}

Why care about transactions? confirm_friend_request(user1, user2) { begin_transaction();   update_friend_list(user1, user2, status.confirmed);   update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}

confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); }catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } } confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } } It gets too complicated with reduced consistency guarantees Sudipto Das {sudipto@cs.ucsb.edu}

Challenge: Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}

Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity Resources Resources Capacity Demand Demand Traditional Infrastructures Deployment in the Cloud Time Time Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu}

Challenge: Self-Manageability • Managing a large distributed system • Detecting failures and recovering • Coordination and synchronization • Provisioning • Capacity planning • … • “A large distributed system is a Zoo” • Cloud platforms inherently multitenant • Balance conflicting goals • Minimize operating cost while ensuring good performance Sudipto Das {sudipto@cs.ucsb.edu}

Contributions for OLTP systems • Transactions at Scale • ElasTraS [HotCloud 2009, UCSB TR 2010] • G-Store [SoCC 2010] • Lightweight Elasticity • Albatross [VLDB 2011] • Zephyr [SIGMOD 2011] • Self-Manageability • Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu}

Contributions Data Management Analytics Novel Architectures Transaction Processing Ricardo [SIGMOD ‘10] Static partitioning Dynamic partitioning Hyder [CIDR ‘11] Best Paper MD-HBase [MDM ‘11] Best PaperRunner up ElasTraS [HotCloud ‘09] [TR ‘10] G-Store [SoCC ‘10] CoTS [ICDE ‘09], [VLDB ‘09] Albatross [VLDB ‘11] Zephyr [SIGMOD ‘11] Anonimos [ICDE ‘10], [TKDE] TCAM [DaMoN ‘08] This talk Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu}

Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}

Scale-out with static partitioning • Table level partitioning (range, hash) • Distributed transactions • Partitioning the Database schema • Co-locate data items accessed together • Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu}

Scale-out with static partitioning • Table level partitioning (range, hash) • Distributed transactions • Partitioning the Database schema • Co-locate data items accessed together • Goal: Minimize distributed transactions • Scaling-out with static partitioning • ElasTraS [HotCloud 2009, TR 2010] • Cloud SQL Server [ICDE 2011] • Megastore [CIDR 2011] • Relational Cloud [CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu}

Dynamically formed partitions • Access patterns change, often rapidly • Online multi-player gaming applications • Collaboration based applications • Scientific computing applications • Not amenable to static partitioning • How to get the benefit of partitioningwhen accesses do not statically partition? • Ours is the first solution to allow that Sudipto Das {sudipto@cs.ucsb.edu}

Online Multi-player Games Player Profile Sudipto Das {sudipto@cs.ucsb.edu}

Online Multi-player Games Execute transactionson player profiles while thegame is in progress Sudipto Das {sudipto@cs.ucsb.edu}

Online Multi-player Games Partitions/groupsare dynamic Sudipto Das {sudipto@cs.ucsb.edu}

Online Multi-player Games Hundreds of thousandsofconcurrent groups Sudipto Das {sudipto@cs.ucsb.edu}

Data Fusion for dynamic partitions[G-Store, SoCC2010] • Transactionalaccess to a group of data itemsformedon-demand • Challenge: Avoid distributed transactions! • Key Group Abstraction • Groupsaresmall • Groupsexecute non-trivial no. of transactions • Groupsaredynamicandon-demand • Groups are dynamically formed tenantdatabases Sudipto Das {sudipto@cs.ucsb.edu}

Transactions on GroupsWithout distributed transactions Grouping Protocol Key Group Ownership of keys at a single node • One key selected as the leader • Followers transferownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu}

Why is group formation hard? • Guarantee the contract between leaders and followers in the presence of: • Leader and follower failures • Lost, duplicated, or re-ordered messages • Dynamics of the underlying system • How to ensure efficient and ACID execution of transactions? Sudipto Das {sudipto@cs.ucsb.edu}

Grouping protocol Log entries • Conceptually akin to “locking” • Locks held by groups L(Joined) L(Free) L(Joining) Follower(s) Create Request JAA J JA DA D Group Opns Leader L(Deleted) L(Joined) L(Deleting) L(Creating) Delete Request Time Sudipto Das {sudipto@cs.ucsb.edu}

Efficient transaction processing • How does the leader execute transactions? • Caches data for group members  underlying data store equivalent to a disk • Transaction logging for durability • Cache asynchronously flushed to propagate updates • Guaranteed update propagation Transaction Manager Log Leader Cache Manager Asynchronous update Propagation Followers Sudipto Das {sudipto@cs.ucsb.edu}

Prototype: G-Store [SoCC 2010]An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping middleware layer resident on top of a key-value store Grouping Layer Transaction Manager Grouping Layer Transaction Manager Grouping Layer Transaction Manager Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu}

G-Store Evaluation • Implemented using HBase • Added the middleware layer • ~10000 LOC • Experiments in Amazon EC2 • Benchmark: An online multi-player game • Cluster size: 10 nodes • Data size: ~1 billion rows (>1 TB) • For groups with 100 keys • Group creation latency: ~10 – 100ms • More than 10,000 groups concurrently created Sudipto Das {sudipto@cs.ucsb.edu}

G-Store Evaluation Group creation latency Group creation throughput Sudipto Das {sudipto@cs.ucsb.edu}

Lightweight Elasticity • Provisioning on-demand and not for peak • Optimize operating cost! Capacity Resources Resources Capacity Demand Demand Traditional Infrastructures Deployment in the Cloud Time Time Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu}

Elasticity in the Database tier Load Balancer Application/Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu}

Live database migration • Migrate a database partition (or tenant) in a livesystem • Optimizeoperating cost • Resource orchestration inmultitenantsystems • Different from • Migration between software versions • Migration in case of schema evolution Sudipto Das {sudipto@cs.ucsb.edu}

VM migration for DB elasticity • One DB partition-per-VM • Pros: allows fine-grained load balancing • Cons • Performance overhead • Poor consolidation ratio [Curino et al., CIDR 2011] • Multiple DB partitions in a VM • Pros: good performance • Cons: Migrate allpartitions  Coarse-grained load balancing VM VM VM VM Hypervisor Hypervisor Sudipto Das {sudipto@cs.ucsb.edu}

Live database migration • Multiple partitions share the same database process • Shared process multitenancy • Migrate individual partitions on-demand in a live system • Virtualization in the database tier • Straightforward solution • Stop serving partition at the source • Copy to destination • Start serving at the destination • Expensive! Sudipto Das {sudipto@cs.ucsb.edu}

Migration cost measures • Service un-availability • Time the partition is unavailable • Number of failed requests • Number of operations failing/transactions aborting • Performance overhead • Impact on response times • Additional data transferred Sudipto Das {sudipto@cs.ucsb.edu}

Two common DBMS architectures • Decoupled storage architectures • ElasTraS, G-Store, Deuteronomy, MegaStore • Persistent data is not migrated • Albatross [VLDB 2011] • Shared nothing architectures • SQL Azure, Relational Cloud, MySQL Cluster • Migrate persistent data • Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu}

Why is live DB migration hard? • Persistent data must be migrated (GBs) • How to ensure no downtime? • Nodes can fail during migration • How to guarantee correctness during failures? • Transaction atomicity and durability • Recover migration state after failure • Transactions execute during migration • How to guarantee serializability? • Transaction correctness equivalent to normal operation Sudipto Das {sudipto@cs.ucsb.edu}

Our approach: Zephyr[SIGMOD 2011] • Migration executed in phases • Starts with transfer of minimal information to destination (“wireframe”) • Database pagesused asgranule of migration • Unique page ownership • Source and destination concurrently execute transactionsin one migration phase • Minimal transaction synchronization • Guaranteed serializability • Logging and handshaking protocols Sudipto Das {sudipto@cs.ucsb.edu}

Simplifying assumptions • For this talk • Transactions access a single partition • No replication • No structural changes to indices • Extensions in the paper [SIGMOD 2011] • Relaxes these assumptions Sudipto Das {sudipto@cs.ucsb.edu}

Design overview P1 P2 P3 Owned Pages Pn TS1,…, TSk Active transactions Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}

Init mode Freeze indices and migrate wireframe P1 P1 P2 P2 P3 P3 Owned Pages Un-owned Pages Pn Pn TS1,…, TSk Active transactions Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}

What is an index wireframe? Source Destination Sudipto Das {sudipto@cs.ucsb.edu}

Dual mode Requests for un-owned pages can block P3 accessed by TDi P1 P1 P2 P2 P3 P3 P3pulled from source Pn Pn TSk+1,…, TSl TD1,…, TDm Old, still active transactions New transactions Destination Source Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}

Finish mode Pages can be pulled by the destination, if needed P1 P1 P2 P2 P3 P3 P1, P2, … pushed from source Pn Pn TDm+1,…, TDn Completed Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}

Normal operation Index wireframe un-frozen P1 P2 P3 Pn TDn+1,…, TDp Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}

Artifacts of this design • Once migrated, pages are never pulled back by source • Abort transactions at source accessing the migrated pages • No structural changes to indices during migration • Abort transactions (at both nodes) that make structural changes to indices • Destination “pulls” pages on-demand • Transactions at the destination experience higher latency compared to normal operation Sudipto Das {sudipto@cs.ucsb.edu}

Serializability • Only concern is “dual mode” • Init and Finish: only one node is executing transactions • Local predicate locking of internal index and exclusive page ownership  no phantoms • Strict 2PL  Transactions are locally serializable • Pages transferred only once • No Tdest  Tsourceconflict dependency • Guaranteed serializability Sudipto Das {sudipto@cs.ucsb.edu}

Recovery • Transaction recovery • For every database page, Tsrc Tdst • Recovery: transactions replayed in conflict order • Migration recovery • Atomic transitions between migration modes • Developed logging and handshake protocols • Every page has exactly one owner • Bookkeeping at the index level Sudipto Das {sudipto@cs.ucsb.edu}

Scalable , Consistent, and Elastic Database Systems for Cloud Platforms

Scalable , Consistent, and Elastic Database Systems for Cloud Platforms

Presentation Transcript

Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms

Cloud Technologies and Cloud Platforms – Intro

Building Scalable Elastic Applications using Makeflow

Consistent Join Queries for Cloud Data Stores

Public Cloud Platforms for .NET Developers

Elastic Systems

CloudScale : Elastic Resource Scaling for Multi-Tenant Cloud Systems

Elastic Data Partitioning for Cloud-based SQL Processing Systems

7.4 Consistent and Inconsistent Systems

Cloud Technologies and Platforms - Intro

Cloud Technologies and Cloud Platforms – Overview

Cloud Models and Platforms

Consistent Linear-Elastic Transformations for Image Matching

Designing scalable applications for cloud

Architecture of Cloud Computing and Distributed Database Systems

Elasca : Workload-Aware Elastic Scalability for Partition Based Database Systems

Scalable Systems and Technology

Evaluation of Massively Scalable Database Management Systems for Erlang

Cloud database systems

Cloud Scalable accessible and affordable

AWS Elastic Cloud Computing

Scalable Networking for Next-Generation Computing Platforms