Strategies for Fault-Tolerant Enterprise Apps

Building Fault-Tolerant Enterprise Applications Greg Hinkle Chariot Solutions chariotsolutions.com Adapted from original presentation by: Erin Mulder & Brian McCallister

Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

Goals of Fault Tolerance What are we really worried about? • Availability • Integrity • Confidentiality • Usability • Cost

Goals of Fault Tolerance What can go wrong? • User Error • Concurrent Changes • Bugs • Resource Failure/Downtime • System Overload • Misconfiguration • Sabotage

Goals of Fault Tolerance Themes we’ll keep visiting… • Prevention • Code Guidelines & Reviews • Automated Validation & Regression Testing • Performance / Stress Testing • Negative / Security Testing • Detection • Logging and Auditing • Validation Patterns • Monitoring • Recovery • Exception handling patterns • Error feedback loop • Redundancy

User Recoverable Errors Simple validation error • What do you do when the user: • Leaves a required field blank • Enters a value too big for the database field • Types letters in a numeric field • Selects inconsistent options • Tries to do things in the wrong order

User Recoverable Errors Simple validation error • Fault tolerance is more than detection • Prevent the user from making errors • Set maxlengths on input fields • Use character masks • Specify units • Show example input • Don’t allow the selection of inconsistent options • Don’t present navigation options that aren’t meant to be followed • Guide the user through longer processes

User Recoverable Errors Simple validation error • Help the user recover quickly • Highlight all errors clearly • Show help text and examples for invalid fields • If some other action is required first, launch it instead of interrupting the flow with frustrating errors • Perception is everything! • Log the error for later analysis • Save enough information to recreate • Start automatically handling common mistakes

User Recoverable Errors Optimistic concurrency clash • Everything looks good until the save • Then… • Item has just gone out of stock • Another user has just updated the same document • Time has passed and action is no longer allowed

User Recoverable Errors Optimistic concurrency clash • Increase save points • Alert user to potential risk: • Low stock • Another user just accessed this record • Another user has “soft lock” on record • Offer useful options for resolving collision: • Merge changes • Backorder • Automatically retry later • “Email me when it is available” • Give tips for avoiding future collisions

User Recoverable Errors Bookmarks, back buttons and browsers • User escapes normal page flow • Bookmarks login page or internal page • Uses back button • Opens a new window within same session • Session times out • Missing context from previous requests • Next click is like bookmark to internal page • Other browser oddities • Double-clicking submit buttons • Pressing stop button in the middle of a request

User Recoverable Errors Bookmarks, back buttons and sessions • Prevention is difficult – the user is in control • Javascript can sometimes help • Javascript can sometimes hurt • Plan for and test each of these scenarios • Plan for handling out-of-sequence requests • Limit state or unique key it

User Recoverable Errors Bookmarks, back buttons and sessions • To seamlessly handle session timeouts and out-of-sequence requests, consider: • Persistent sessions (saved to database) • Passing state in every request (form fields or URL rewriting) • Storing state in custom cookies • Adding custom logic to recover from timed-out sequences • Resubmit requests after re-authentication • To simply detect and alert, consider: • Using listener to catch session expiration • Using state validation to catch out-of-sequence requests • Redirecting user to session expiration page • To improve process: • Log session losses (requests within expired session) • Consider increasing session timeout • Consider using prevention techniques described above

User Recoverable Errors Bookmarks, back buttons and sessions • To minimize impact of back button, consider: • Techniques described for out-of-sequence requests • Redirecting to GETs instead of returning responses to POSTs • To work around double submissions, consider: • Utilize unique transaction identifiers stored in session • Forward action submissions to separated response pages • Response pages automatically display on double submit • To handle multiple windows, consider: • Passing state in every request • Pass state in hidden fields throughout a wizard • Adapting web frameworks to map state (e.g. Struts form beans) by primary key or request ID instead of a static name

Expected Application Errors Resource is unavailable… • Database is down for maintenance • No connection to integrated partner service • Resource is overloaded: • Out of DB connections • JMS Queue full

Expected Application Errors Resource is unavailable… • To prevent, consider: • Coordinating maintenance schedules • Planning for failover at the resource level • Increasing hardware budget  • Increasing transaction timeout seconds (caution – last resort) • To handle, analyze transactional requirements: • Is immediate user response necessary? • Can the resource access be handled asynchronously with an extended, logical transaction? • Plan rollbacks carefully to allow for retries (consider idempotence, sub-transactions) • Alert operator/admin if out of SLA • Log all outages (study for patterns)

Expected Application Errors Application is overloaded… • Mentioned on CNBC • Linked from Slashdot • Denial of Service

Expected Application Errors Application is overloaded… • Test under heavy load • Plan for growth • Tune hot spots • Run with excess capacity • Throttle at network level • Use JMS and other asynchronous technologies to throttle on backend • Tune application server to degrade gracefully • Monitor carefully • Be prepared to scale out, not just up

Expected Application Errors Bugs and other undocumented features… • Friendly bug: • Triggers invalid state • Causes VM or app server to throw exception • Greedy bug: • Monopolizes resources • Leaks connections • Silent and deadly bug: • Corrupts data

Expected Application Errors Bugs and other undocumented features… • To handle friendly bugs: • Bulletproof your transactions & rollbacks • Write coding and design guidelines • Conduct peer code reviews (share best practices) • For client applications, catch Throwable • Map exception handling in server container • The finally clause is your friend • Display sanitized errors to user • Give enough information to map back to logs • Log carefully to allow easy debugging • Configure timestamp, thread id output • Log data together not individually • Alert operator/administrator

Expected Application Errors Bugs and other undocumented features… • To handle greedy bugs: • Reduce transaction timeout seconds • Handle timeouts in the same way as friendly bugs • Monitor carefully • Log statistics (# of transaction timeouts, CPU usage, memory usage, GC, network traffic, stuck threads…) • Automate log analysis • Trigger a thread dump (kill -3) during hot spots • Alert operator/administrator to hot spots • Use clustering to contain damage

Expected Application Errors Bugs and other undocumented features… • To handle silent and deadly bugs: • Bulletproof transaction settings • Validate on multiple levels, use referential integrity • Audit everything • Unless performance/cost prohibits, keep a complete audit trail on every table (easy with triggers, aspects or code generators), try to include transaction ID • Flush caches regularly • After a save, load the record from the database and display back to the user • Run periodic audits with human review • Plan for how to use audit trail to recover from data corruption • Early detection is key… escalate user concerns!

System Failure Never have an “unplanned” outage • Determine acceptable downtime • Plan clustering / failover accordingly • Monitor carefully so outages are detected immediately • Be ready with a tiny “planned outage” page and server in advance • Consider offsite host • Build this functionality into non-Web clients at development time • Plan for transaction recovery • Plan for JMS recovery • Use “quiescing” load balancing to bring servers offline for maintenance

System Failure Sabotage • Encrypt data in database • Security through obscurity • Key entry on startup • Credit cards should be two-way encrypted (resist the urge to Rot13) • Passwords should be one-way hashed • Create new temporary passwords for “forgotten pass” • SQL Injection Prevention • Don’t dynamically generate SQL with user input • Use prepare statements • Cross-site scripting • Cleanse any user data republished on a site • Don’t publish extra information • Turn of server headers, require SSL on login or throughout • Create a DMZ • Two firewalls • Use SSL between tiers

Useful Strategies Be sure that you develop guidelines for: • Error Messages • Validation (format, business rules, size, cleansing…) • Logging (when, where, what…) • Auditing • Monitoring (level of automation, alerts) • Transactions (who rolls back, checked vs. unchecked…) • Sessions & Caching (request vs. session, flushing…) • Clustering

Useful Strategies Error Messages • For validation errors, be sure to: • Include format and size hints • Show examples • Give more information than the basic field label • Mention the error at the top of the screen and Highlight the field • Catch all errors at the same time • For other user-recoverable errors • Let the user know what to do next • If the user can’t recover • Apologize • Give no details • Suggest workarounds • (Silently log and alert!)

Useful Strategies Validation • If possible, validate at all levels • Common strategies: • Externalize validation rules and use a framework that supports rich validation • Clearly define which layers are responsible for which types of validation. For example: • All format errors handled in web tier • All business rule violations handled in application tier • All field lengths enforced at data tier

Useful Strategies Logging • Log in all tiers • Define logging levels and when they are used • Log user failures at different levels than system failures • Include: timestamp, user, thread ID, transaction ID, etc. • Don’t make logs a source of failure (watch disk space, JMS load, etc.) • Log information in a single call • Aggregate server logs • Socket appender • Scripts and mounting Bad log.trace(“Searching: “ + keyword); log.trace(“Found: “ + results.size()); Good Log.trace(“Searching: “ + keyword + “Found: “ + results.size());

Useful Strategies Auditing • Audit operations where possible • Provides accountability • Easier to support users • Easier to debug • Easier to recover from disaster • Easier to detect attacks • Include: • Timestamp • Current User • Some sort of thread ID, transaction ID, etc. • Complete data record or diff

Useful Strategies Monitoring • Common strategies include: • 24/7 operations center • Business hours operation center • Automated, redundant processes that analyze logs and raise alerts to on-call administrators • SNMP and monitors • Logs show more than critical errors • Ideally, mine them for clues on usability, performance problems and attacks • JMX clients

Useful Strategies Monitoring - Tools • Free • Nagios (Host, Network, Service monitoring) • Groundwork Monitor • MC4J • EJTools • Cost • AdventNet • OpenView

Useful Strategies Transactions • Top server-side tier creates a user transaction, catches all errors and then determines its fate • Container-managed transactions with session façade: • Top level methods responsible for rollbacks • Business methods responsible for rollbacks • Unchecked exceptions not recommended with EJB • Unchecked exceptions with Spring

Useful Strategies Sessions and Caching • Use session sparingly • Common strategies: • Hidden form fields • Cookies (encrypted) • URL rewriting • HTTP Session • Shared caches (OSCache, Tangosol) • When to flush cache? • Caches can mask data problems • Data should have timeouts • Shared caches should limit usage (LRU)

Useful Strategies Clustering • Why use clusters? • Availability • Scalability • Will this application need a cluster? • Can you take it offline for maintenance? • Can you take it offline to scale it up? • Are you sure you won’t need to scale it out? • Can be expensive and complicated • Can require more expensive licensing • Requires serializable data in session • Limit the use of session and re-put objects on edit • Requires more testing (test fail over conditions)

Useful Strategies Clustering • JBoss & Tomcat have limited cluster sizes • Multicast can require network and operating system changes • Multiple JVMs and log files to monitor • Configuration management issues • Synchronizing updates • Custom settings per instance

Discussion Get the slides online at: http://www.chariotsolutions.com/slides 40

Building Fault-Tolerant Enterprise Applications Greg Hinkle Chariot Solutions chariotsolutions.com

Strategies for Fault-Tolerant Enterprise Apps

Strategies for Fault-Tolerant Enterprise Apps

Presentation Transcript

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Backplane

Building Fault Tolerant Voice User Interfaces

Middleware for Fault Tolerant Applications

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing