590 likes | 688 Views
Scalable , Consistent, and Elastic Database Systems for Cloud Platforms. Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu. Sponsors:. Web replacing Desktop. Paradigm shift in Infrastructure. Cloud computing.
E N D
Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara sudipto@cs.ucsb.edu Sponsors:
Web replacing Desktop Sudipto Das {sudipto@cs.ucsb.edu}
Paradigm shift in Infrastructure Sudipto Das {sudipto@cs.ucsb.edu}
Cloud computing • Computing infrastructureand solutions delivered as a service • Industry worth USD150 billion by 2014* • Contributors to success • Economies of scale • Elasticity and pay-per-use pricing • Popular paradigms • Infrastructure as a Service (IaaS) • Platform as a Service (PaaS) • Software as a Service (SaaS) *http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm Sudipto Das {sudipto@cs.ucsb.edu}
Databases for cloud platforms • Data is central to applications • DBMSs are mission critical component in cloud software stack • Manage petabytes of data, drive revenue • Serve a variety of applications (multitenancy) • Data needs for cloud applications • OLTP systems: store and serve data • Data analysis systems: decision support, intelligence Sudipto Das {sudipto@cs.ucsb.edu}
Application landscape • Social gaming • Rich content and mash-ups • Managed applications • Cloud application platforms Sudipto Das {sudipto@cs.ucsb.edu}
Challenges for OLTP systems • Scalability • While ensuring efficient transaction execution! • Lightweight Elasticity • Scale on-demand! • Self-Manageability • Intelligence without a human controller! Sudipto Das {sudipto@cs.ucsb.edu}
Two approaches to scalability • Scale-up • Preferred in classical enterprise setting (RDBMS) • Flexible ACID transactions • Transactions access a single node • Scale-out • Cloud friendly (Key value stores) • Execution at a single server • Limited functionality & guarantees • No multi-row or multi-step transactions Sudipto Das {sudipto@cs.ucsb.edu}
Why care about transactions? confirm_friend_request(user1, user2) { begin_transaction(); update_friend_list(user1, user2, status.confirmed); update_friend_list(user2, user1, status.confirmed); end_transaction(); } Simplicity in application design with ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}
confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); }catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; } } confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time()); } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); } } It gets too complicated with reduced consistency guarantees Sudipto Das {sudipto@cs.ucsb.edu}
Challenge: Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}
Challenge: Lightweight Elasticity Provisioning on-demand and not for peak Optimize operating cost! Capacity Resources Resources Capacity Demand Demand Traditional Infrastructures Deployment in the Cloud Time Time Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu}
Challenge: Self-Manageability • Managing a large distributed system • Detecting failures and recovering • Coordination and synchronization • Provisioning • Capacity planning • … • “A large distributed system is a Zoo” • Cloud platforms inherently multitenant • Balance conflicting goals • Minimize operating cost while ensuring good performance Sudipto Das {sudipto@cs.ucsb.edu}
Contributions for OLTP systems • Transactions at Scale • ElasTraS [HotCloud 2009, UCSB TR 2010] • G-Store [SoCC 2010] • Lightweight Elasticity • Albatross [VLDB 2011] • Zephyr [SIGMOD 2011] • Self-Manageability • Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu}
Contributions Data Management Analytics Novel Architectures Transaction Processing Ricardo [SIGMOD ‘10] Static partitioning Dynamic partitioning Hyder [CIDR ‘11] Best Paper MD-HBase [MDM ‘11] Best PaperRunner up ElasTraS [HotCloud ‘09] [TR ‘10] G-Store [SoCC ‘10] CoTS [ICDE ‘09], [VLDB ‘09] Albatross [VLDB ‘11] Zephyr [SIGMOD ‘11] Anonimos [ICDE ‘10], [TKDE] TCAM [DaMoN ‘08] This talk Pythia [in progress] Sudipto Das {sudipto@cs.ucsb.edu}
Transactions at Scale Key Value Stores Scale-out RDBMSs ACID transactions Sudipto Das {sudipto@cs.ucsb.edu}
Scale-out with static partitioning • Table level partitioning (range, hash) • Distributed transactions • Partitioning the Database schema • Co-locate data items accessed together • Goal: Minimize distributed transactions Sudipto Das {sudipto@cs.ucsb.edu}
Scale-out with static partitioning • Table level partitioning (range, hash) • Distributed transactions • Partitioning the Database schema • Co-locate data items accessed together • Goal: Minimize distributed transactions • Scaling-out with static partitioning • ElasTraS [HotCloud 2009, TR 2010] • Cloud SQL Server [ICDE 2011] • Megastore [CIDR 2011] • Relational Cloud [CIDR 2011] Sudipto Das {sudipto@cs.ucsb.edu}
Dynamically formed partitions • Access patterns change, often rapidly • Online multi-player gaming applications • Collaboration based applications • Scientific computing applications • Not amenable to static partitioning • How to get the benefit of partitioningwhen accesses do not statically partition? • Ours is the first solution to allow that Sudipto Das {sudipto@cs.ucsb.edu}
Online Multi-player Games Player Profile Sudipto Das {sudipto@cs.ucsb.edu}
Online Multi-player Games Execute transactionson player profiles while thegame is in progress Sudipto Das {sudipto@cs.ucsb.edu}
Online Multi-player Games Partitions/groupsare dynamic Sudipto Das {sudipto@cs.ucsb.edu}
Online Multi-player Games Hundreds of thousandsofconcurrent groups Sudipto Das {sudipto@cs.ucsb.edu}
Data Fusion for dynamic partitions[G-Store, SoCC2010] • Transactionalaccess to a group of data itemsformedon-demand • Challenge: Avoid distributed transactions! • Key Group Abstraction • Groupsaresmall • Groupsexecute non-trivial no. of transactions • Groupsaredynamicandon-demand • Groups are dynamically formed tenantdatabases Sudipto Das {sudipto@cs.ucsb.edu}
Transactions on GroupsWithout distributed transactions Grouping Protocol Key Group Ownership of keys at a single node • One key selected as the leader • Followers transferownership of keys to leader Sudipto Das {sudipto@cs.ucsb.edu}
Why is group formation hard? • Guarantee the contract between leaders and followers in the presence of: • Leader and follower failures • Lost, duplicated, or re-ordered messages • Dynamics of the underlying system • How to ensure efficient and ACID execution of transactions? Sudipto Das {sudipto@cs.ucsb.edu}
Grouping protocol Log entries • Conceptually akin to “locking” • Locks held by groups L(Joined) L(Free) L(Joining) Follower(s) Create Request JAA J JA DA D Group Opns Leader L(Deleted) L(Joined) L(Deleting) L(Creating) Delete Request Time Sudipto Das {sudipto@cs.ucsb.edu}
Efficient transaction processing • How does the leader execute transactions? • Caches data for group members underlying data store equivalent to a disk • Transaction logging for durability • Cache asynchronously flushed to propagate updates • Guaranteed update propagation Transaction Manager Log Leader Cache Manager Asynchronous update Propagation Followers Sudipto Das {sudipto@cs.ucsb.edu}
Prototype: G-Store [SoCC 2010]An implementation over Key-value stores Application Clients Transactional Multi-Key Access Grouping middleware layer resident on top of a key-value store Grouping Layer Transaction Manager Grouping Layer Transaction Manager Grouping Layer Transaction Manager Key-Value Store Logic Key-Value Store Logic Key-Value Store Logic Distributed Storage G-Store Sudipto Das {sudipto@cs.ucsb.edu}
G-Store Evaluation • Implemented using HBase • Added the middleware layer • ~10000 LOC • Experiments in Amazon EC2 • Benchmark: An online multi-player game • Cluster size: 10 nodes • Data size: ~1 billion rows (>1 TB) • For groups with 100 keys • Group creation latency: ~10 – 100ms • More than 10,000 groups concurrently created Sudipto Das {sudipto@cs.ucsb.edu}
G-Store Evaluation Group creation latency Group creation throughput Sudipto Das {sudipto@cs.ucsb.edu}
Lightweight Elasticity • Provisioning on-demand and not for peak • Optimize operating cost! Capacity Resources Resources Capacity Demand Demand Traditional Infrastructures Deployment in the Cloud Time Time Unused resources Slide Credits: Berkeley RAD Lab Sudipto Das {sudipto@cs.ucsb.edu}
Elasticity in the Database tier Load Balancer Application/Web/Caching tier Database tier Sudipto Das {sudipto@cs.ucsb.edu}
Live database migration • Migrate a database partition (or tenant) in a livesystem • Optimizeoperating cost • Resource orchestration inmultitenantsystems • Different from • Migration between software versions • Migration in case of schema evolution Sudipto Das {sudipto@cs.ucsb.edu}
VM migration for DB elasticity • One DB partition-per-VM • Pros: allows fine-grained load balancing • Cons • Performance overhead • Poor consolidation ratio [Curino et al., CIDR 2011] • Multiple DB partitions in a VM • Pros: good performance • Cons: Migrate allpartitions Coarse-grained load balancing VM VM VM VM Hypervisor Hypervisor Sudipto Das {sudipto@cs.ucsb.edu}
Live database migration • Multiple partitions share the same database process • Shared process multitenancy • Migrate individual partitions on-demand in a live system • Virtualization in the database tier • Straightforward solution • Stop serving partition at the source • Copy to destination • Start serving at the destination • Expensive! Sudipto Das {sudipto@cs.ucsb.edu}
Migration cost measures • Service un-availability • Time the partition is unavailable • Number of failed requests • Number of operations failing/transactions aborting • Performance overhead • Impact on response times • Additional data transferred Sudipto Das {sudipto@cs.ucsb.edu}
Two common DBMS architectures • Decoupled storage architectures • ElasTraS, G-Store, Deuteronomy, MegaStore • Persistent data is not migrated • Albatross [VLDB 2011] • Shared nothing architectures • SQL Azure, Relational Cloud, MySQL Cluster • Migrate persistent data • Zephyr [SIGMOD 2011] Sudipto Das {sudipto@cs.ucsb.edu}
Why is live DB migration hard? • Persistent data must be migrated (GBs) • How to ensure no downtime? • Nodes can fail during migration • How to guarantee correctness during failures? • Transaction atomicity and durability • Recover migration state after failure • Transactions execute during migration • How to guarantee serializability? • Transaction correctness equivalent to normal operation Sudipto Das {sudipto@cs.ucsb.edu}
Our approach: Zephyr[SIGMOD 2011] • Migration executed in phases • Starts with transfer of minimal information to destination (“wireframe”) • Database pagesused asgranule of migration • Unique page ownership • Source and destination concurrently execute transactionsin one migration phase • Minimal transaction synchronization • Guaranteed serializability • Logging and handshaking protocols Sudipto Das {sudipto@cs.ucsb.edu}
Simplifying assumptions • For this talk • Transactions access a single partition • No replication • No structural changes to indices • Extensions in the paper [SIGMOD 2011] • Relaxes these assumptions Sudipto Das {sudipto@cs.ucsb.edu}
Design overview P1 P2 P3 Owned Pages Pn TS1,…, TSk Active transactions Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}
Init mode Freeze indices and migrate wireframe P1 P1 P2 P2 P3 P3 Owned Pages Un-owned Pages Pn Pn TS1,…, TSk Active transactions Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}
What is an index wireframe? Source Destination Sudipto Das {sudipto@cs.ucsb.edu}
Dual mode Requests for un-owned pages can block P3 accessed by TDi P1 P1 P2 P2 P3 P3 P3pulled from source Pn Pn TSk+1,…, TSl TD1,…, TDm Old, still active transactions New transactions Destination Source Page owned by Node Index wireframes remain frozen Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}
Finish mode Pages can be pulled by the destination, if needed P1 P1 P2 P2 P3 P3 P1, P2, … pushed from source Pn Pn TDm+1,…, TDn Completed Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}
Normal operation Index wireframe un-frozen P1 P2 P3 Pn TDn+1,…, TDp Destination Source Page owned by Node Page not owned by Node Sudipto Das {sudipto@cs.ucsb.edu}
Artifacts of this design • Once migrated, pages are never pulled back by source • Abort transactions at source accessing the migrated pages • No structural changes to indices during migration • Abort transactions (at both nodes) that make structural changes to indices • Destination “pulls” pages on-demand • Transactions at the destination experience higher latency compared to normal operation Sudipto Das {sudipto@cs.ucsb.edu}
Serializability • Only concern is “dual mode” • Init and Finish: only one node is executing transactions • Local predicate locking of internal index and exclusive page ownership no phantoms • Strict 2PL Transactions are locally serializable • Pages transferred only once • No Tdest Tsourceconflict dependency • Guaranteed serializability Sudipto Das {sudipto@cs.ucsb.edu}
Recovery • Transaction recovery • For every database page, Tsrc Tdst • Recovery: transactions replayed in conflict order • Migration recovery • Atomic transitions between migration modes • Developed logging and handshake protocols • Every page has exactly one owner • Bookkeeping at the index level Sudipto Das {sudipto@cs.ucsb.edu}