240 likes | 386 Views
Doh !. Reliability in cloud and mobile apps. http://www.flickr.com/photos/johanl/4934459020. Traditional client-server vs cloud. Traditional client-server Usually highly-reliable server available on demand Cloud Garbage hardware that could fail at any time
E N D
Doh! Reliabilityin cloud and mobile apps http://www.flickr.com/photos/johanl/4934459020
Traditional client-server vs cloud • Traditional client-server • Usually highly-reliable server available on demand • Cloud • Garbage hardware that could fail at any time • Challenge: ensure reliability of apps nonetheless
BTW, cloud servers aren’t necessarily very well-configured, either • Example GAE servers • 128MB-1GB RAM; 600MHz-4.8GHz CPU https://developers.google.com/appengine/docs/java/config/backends • Amazon EC2 “medium” servers • 3.75 GB RAM; 2.0-2.4 GHz 2007 Opteron CPU http://aws.amazon.com/ec2/instance-types/ • My cheap, busted up, 4-year old laptop • 3 GB RAM; 2x2.4 GHz Intel (Core Duo) CPU cores May 14, 2012
Yet, from the current GAEService Level Agreement (SLA) They really mean to be highly reliable!!!!So how do they do it? How can you make the most of it? https://developers.google.com/appengine/sla
SLAs often quote reliability as “nines” • Two nines: 99%, 3.65 days downtime every year • Easy to do with cheap hardware + backup • Three nines: 99.9%, 8 hours every year • Can be done with reasonably good hardware • Four nines: 99.99%, < 1 hour every year • Not all systems can do this • Fine nines: 99.999%, 7 minutes every year • Very hard to achieve, and very expensive • Each “nine” approximately doubles the cost
Key reliability principles • Replication • Provide a means for monitoring • Consider using a hybrid cloud
Replication of computation • GAE automatically will copy your code • Starting up multiple servers to handle requests • If your server generally responds quickly to requests • And there is extra hardware available at the moment • Automatically balancing load Replication Monitoring Hybridize
Data also needs replication • You can control the level of replication • Old-fashioned (traditional client-server) • Set up a “master” database server • Configure the master to copy its data to “slaves” (e.g., every night) • Cloud-based approach • Let the infrastructure replicate data automatically • GAE: You have two options… master/slave, and high-replication datastore Replication Monitoring Hybridize
High-replication datastore (HRD) vs master/slave datastore (MSD) • HRD makes backup copies across datacenters (and > 2 copies—MSD has only 2 copies) • HRD includes a more sophisticated algorithm for resolving errors on (some) servers • MSD: writes all go to the master (if available); master copied to slaves; reads all go to the master (if available) [Deprecated!] • HRD: more sophisticated algorithm where the different servers (no master) form a consensus Replication Monitoring Hybridize
Pros of using HRD • Pro: Reliability is vastly improved • Largely due to replication of data across datacenters • Pro: support for cross-group transactions in Python • Apparently? Test before relying on it! • Maybe available in Java? • Config change needed? https://developers.google.com/appengine/docs/python/datastore/overview#Cross_Group_Transactions Replication Monitoring Hybridize
Cons: Latency and eventual consistency • Con: Latency can be pretty big (> 1 second) • Writes (and reads) go to multiple servers, multiple datacenters • Con: Data just written might not appear in a read • GAE might write to server X but then read from Y • Data on X might not be copied to Y right away
Coping with problem #1, latency • Cache a copy of data on client • Eliminates the need to hit the server • Bonus: improves reliability when server is offline • Write a copy to memcache • So you can read back faster • Only do this for data you read a lot, of course Replication Monitoring Hybridize
Coping with problem #2, writes not appearing on read • Don’t assume that an entity you just wrote will immediately appear in a query (in HRD) • Wait a few seconds to read back • Or automatically append the written entity to the query results if you don’t see it Replication Monitoring Hybridize
Example pseudocode(must be fancier for sorted queries) Course mycourse = create a new entity pm.makePersistent(mycourse) List<Course> courses = query for courses booleansawit = false foreach (Course course in courses) if (course.id == mycourse.id) {sawit = true; break;} If !(sawit) courses.add(mycourse); Foreach (Course course in courses) do something with course
Coping with another reliability problem (#3), exception on commit • If you use transactions (locks), you will get exceptions on multiple simultaneous writes • True for MSD, HRD, or any other platform that relies on optimistic locking • Use a try/catch/retry approach • Repeatedly try to write your updates if they fail on the first try Replication Monitoring Hybridize
Example pseudocode int retries = 10 while (--retries >= 0) { try { Start transaction Course mycourse = get the course entity make modifications to mycourse pm.makePersistent(mycourse) commit transaction retries = 0 } catch (JDOException) { log the exception } }
Monitoring • You should provide a means of monitoring your system’s uptime • Common approach: Script on client elsewhere • Could be another cloud service (e.g., EC2) • Script accesses the server • Client tracks success rate + latency Replication Monitoring Hybridize
What to monitor • The services of the application itself • You probably need to include some test data • Also three other “dummy” services • One that just returns • One that reads from datastore • One that writes to datastore and reads back Replication Monitoring Hybridize
Things you can do with data • Detect when one/some of your application’s services have crashed • Or are getting slow • Detect if any problems are your fault • i.e., one of your own application’s services has failed but the dummy services are working • Decide whether/when/how to redesign • Changes to your own application • Integrate a different cloud platform Replication Monitoring Hybridize
Consider using a hybrid cloud • Distributing code and data across platforms • Example: EC2 + GAE • Example: EC2 + your own servers • Ways that hybrid can help • Taking advantage of specialized APIs • Fail-over when one platform fails • Protecting access to data Replication Monitoring Hybridize
Hybrid cloud scenario #1 • Your application analyzes some binary files. The analyzer code only runs on Windows. Unfortunately, Azure is very expensive. • Solution: • Deploy the analyzer on Azure • Expose its functionality via network calls • Deploy most of the code on GAE (nice and cheap) • The GAE part of the application calls the Azure part of the application and stores result in GAE Replication Monitoring Hybridize
Hybrid cloud scenario #2 • Your application is on EC2 and has demonstrated high performance + reliability. But the outage a few years back scared your manager. • Solution: • Tweak the application to run on GoGrid (very similar to EC2) • But continue hosting on EC2, where your application has shown excellent performance. • Tweak your client so that if your EC2 server stops responding, then it calls GoGrid instead • Write scripts on GoGrid and EC2 to sync data. Replication Monitoring Hybridize
Hybrid cloud scenario #3 • Some of your data is very sensitive and cannot be trusted to cloud providers. Other data and associated computations are not sensitive and have periodic demand spikes. • Solution: • Deploy the sensitive data on your server and the not-so-sensitive data+computation on cloud. • In your client, invoke the company server for computations on sensitive data and invoke cloud servers for not-so-sensitive data+computation. Replication Monitoring Hybridize
Key reliability principles • Replicate • Replicate your code • Use the high-replication datastore • Be prepared to cope with problems • Replicate data to client and memcache • Detect and handle writes-not-appearing-on-read • Try/catch/retry approach to handle failure • Provide a means for monitoring • Consider using a hybrid cloud • For APIs, fail-over, securing data