290 likes | 419 Views
Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation. Chris Cuevas, Systems Administrator (airwalk@ufl.edu) Martin Smith, Systems Administrator (smithmb@ufl.edu). What is a. Design pattern.
E N D
Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation Chris Cuevas, Systems Administrator (airwalk@ufl.edu) Martin Smith, Systems Administrator (smithmb@ufl.edu)
What is a... Design pattern "A general reusable solution to a commonly occurring problem." [1] [1] http://en.wikipedia.org/wiki/Design_pattern_%28computer_science%29 12th Sakai Conference – Los Angeles, California – June 14-16
Patterns for… Change control, build promotion, deployment 12th Sakai Conference – Los Angeles, California – June 14-16
What do we consider a complete build? • Version number • Readme file • Change log • SQL scripts • Sakai 'binary' distribution • Reduce ambiguity, recovery time, and improves the chance of catching errors early Pattern: Baseline set of artifacts for a change 12th Sakai Conference – Los Angeles, California – June 14-16
All changes are load tested and functionally tested against monitoring scripts (i.e. our test cluster is the same size as our prod cluster, and it is monitored like prod) • All changes require a full two weeks of testing time, a go/no-go decision at least 4 days before (this allows us to announce the change), and at least a 2 hour maintenance window Pattern: build promotion process 12th Sakai Conference – Los Angeles, California – June 14-16
During a deployment/build promotion, we have two strategies: • Rolling restart: Quiesce nodes, upgrade them, and reintroduce them • Full outage: Stop all nodes, upgrade in chunks, apply any SQL, and start them all • Session replication is key here for seamless upgrades (and with Sakai, we don't have it). Pattern: Maintenance for a new build 12th Sakai Conference – Los Angeles, California – June 14-16
Patterns for… Other Software (OS/DB/etc patches, updates) 12th Sakai Conference – Los Angeles, California – June 14-16
High risk packages are identified, only updated by those who know the application best • All others packages are updated (at least) quarterly • Database patches are done best-effort (for now) • Rarely, infrastructure-wide changes will affect a particular service worse than others • We reserve a weekly maintenance window • Least well understood at this time Patterns: Other updates 12th Sakai Conference – Los Angeles, California – June 14-16
Patterns for… Traffic Management 12th Sakai Conference – Los Angeles, California – June 14-16
User Traffic dispatching • Sticky TCP traffic to Apache httpd frontends based on perceived health • Cookie based route from httpd to tomcat, with ability to select a node • Both of these fail to failover session information well • We’re considering a design pattern where we combine the httpd+tomcat stack and do full NAT dispatching so that we can get more change flexibility • Compare other architectures Pattern: Application stack 12th Sakai Conference – Los Angeles, California – June 14-16
Current cluster layout 12th Sakai Conference – Los Angeles, California – June 14-16
Current cluster layout as two sites 12th Sakai Conference – Los Angeles, California – June 14-16
Site-local dispatching 12th Sakai Conference – Los Angeles, California – June 14-16
Combining more of the stack 12th Sakai Conference – Los Angeles, California – June 14-16
Database failover is automatic now with Oracle & JDBC • File tier still doesn't do failover in any nice way • Application+web tier no longer complex dependencies • (All state for a user lives on a single server now) • Split presence across two sites for database (dataguard), file storage (emc celerra), app/web tier (vmware) Pattern: Resource clustering 12th Sakai Conference – Los Angeles, California – June 14-16
Patterns for… Monitoring and logging 12th Sakai Conference – Los Angeles, California – June 14-16
Overall: • Fully synthetic login to Sakai • Cluster checks on Apache and Tomcat (more than X out of Y servers in the cluster in a bad state) • Wget? • Individual server checks for web, app, db tiers • Database connection pool • Clock, SNMP, Ping, Disk • Java processes, Apache configtest • AJP and Web response time and status codes • Replication health, available storage growth Pattern: System health checks 12th Sakai Conference – Los Angeles, California – June 14-16
Fully automated functional test that authenticates and requests some course sites • Response time is as-important as success or failure • We’re hesitant to automatically restart application nodes, since session replication isn’t available – this would be a major interruption to our users Pattern: Interventions 12th Sakai Conference – Los Angeles, California – June 14-16
Collect the usual suspects • sakai events, automatic (?) thread dumps to detect stuck processes, server-status results • Sakai health: .jsp file that dumps many data points (JVM memory, ehcache stats, database pools, etc) • Anything we can pull from the JVM or Sakai APIs, we’ll use that jsp file and collectd Pattern: Collecting data 12th Sakai Conference – Los Angeles, California – June 14-16
Also known as, "Get close to the user" • Bug reports are aggregated using shared mailbox, send daily/weekly/yearly reports with buckets for browser, user, course site, tool, stack trace hash, etc • Redirection for 4XX/5XX http status codes as much as possible, with explanations • Timeouts for long-running activities, so make sure traffic isn’t waiting forever • Watch for AJP errors from specific application servers Pattern: Application responsiveness 12th Sakai Conference – Los Angeles, California – June 14-16
user => count:atorres78 (Alina Torres) => 32lisareeve (Lisa Jacobs) => 26ziggy41 (Stefan Katz) => 15ngrosztenger (Nathalie Grosz-Tenger) => 14agabriel2450 (Gabriel Arguello) => 12stack-trace-digest => count:41D7C94702B20B270953EBB00ECA9F5C1388A393 => 180DEB88C2307DA572C9C1EFE1E8E17828DC29A7C00 => 154A600DAE1792C82B1472C9980EED8938E5F39B4F0 => 8815963E2F2314286E1BC1A24DF953560B7845BDCE => 33042CF39E8D34570CD3D79152B757A090AB6AB39F => 24app-server => count:sakaiapp-prod06.osg.ufl.edu => 154sakaiapp-prod02.osg.ufl.edu => 146sakaiapp-prod04.osg.ufl.edu => 118sakaiapp-prod05.osg.ufl.edu => 96sakaiapp-prod03.osg.ufl.edu => 83 Summary of weekly Sakai bug reports for 2011-06-12:browser-id => count:Mac-Mozilla => 377Win-InternetExplorer => 356Win-Mozilla => 194UnknownBrowser => 33empty => 12service-version => count:[r329] => 967empty => 8
Patterns for… Backup and recovery 12th Sakai Conference – Los Angeles, California – June 14-16
File tier is backed up every 4 hours, with a 2 week retention window • Database tier is backed up daily, with archived redo logs every 4 hours, and 2 week retention window Pattern: Backing up for DR 12th Sakai Conference – Los Angeles, California – June 14-16
Hoping this comes from application-specific operations to backup and restore (and delete!) user specific data • Can't do a full restore of your files and database every time your user deletes a site by accident • Strive for reasonable windows of retention (e.g. hardware, software, application-level data) • This is supposedly coming in Sakai 2.x Pattern: Backing up user data 12th Sakai Conference – Los Angeles, California – June 14-16
Database and file tier are both replicated to a 2nd site, file tier is also redundant internally, some manual intervention still required there Pattern: Multi-site replication 12th Sakai Conference – Los Angeles, California – June 14-16
We use ‘snapshot standby’ in Oracle RDBMS to take read consistent copies of production for reloading test and development copies • We use rsync to copy over the file storage tier • With our full set of build artifacts from earlier, we can always build a complete version of what's in prod Pattern: Bringing production to test 12th Sakai Conference – Los Angeles, California – June 14-16
Questions? Thank you! 12th Sakai Conference – Los Angeles, California – June 14-16