360 likes | 380 Views
Bristlecone is a language designed to address current software instability issues by decoupling operations, specifying task dependencies, and using transactions for error recovery. It enhances robustness and enables automated recovery from critical failures. Learn about the web server example, object tagging, and task specifications in Bristlecone language.
E N D
Bristlecone: A Language for Robust Software Systems Brian Demsky Alokika Dash University of California, Irvine
Current Software is All or Nothing • Most current software either executes perfectly or fails completely • Small errors cause catastrophic failures • Violate fundamental developer assumptions • Violated assumptions prevent continued execution • No clean way to recover from errors • Unclear what parts of the program are affected • Failure may leave key data structures partially updated
Degraded Service Can Be Desirable • Consider a bug that affects an embedded application in a single web page • Current browsers often close all browser windows and exit • Users find this behavior frustrating • Better option is to isolate the failure and halt only the embedded application component
Request r=receiverequest(); logRequest(r); processRequest(r); Motivating Example: Web Server
Request r=receiverequest(); logRequest(r); processRequest(r); Motivating Example: Web Server • Failure in log operation • Prevents serving this request • If logging failure is independent of request, potentially causes system to fail to serve any requests CRASH
Real World Example • First Flight of the Ariane 5 Rocket • Uncaught integer overflow in computation that computed horizontal bias • Overflow shutdown the inertial reference system • Inertial reference sent debug information to the guidance system • Guidance system used these invalid values to set incorrect nozzle deflections • $120 Million rocket crashed • Horizontal bias value is not even used! • Lesson: Critical system operations coupled to non-critical operations
Observations about Recovery • Challenging to recover from failure with traditional program structures • Unclear what code was doing • Are data structures consistent? • What depends on the failed code? What is still safe to do? • Code structure introduces artificial dependences • In the absence of precise dependence information, we must assume the worst case • Failures can propagate through artificial dependences small errors can cause catastrophic failures
Where do we lose information? • Specifications describe functionality requirements • Architecture/implementation phases map requirements into sequence of operations • Mapping process loses information: • Boundaries of operations (What is A and what is B?) • Temporal dependences (Does B require A?) • Data dependences (Does B use data produced by A?) • Lost information introduces artificial dependences
Designing for Robustness • Underlying assumption: All code contains bugs • Goal: Mitigate the consequences • Approach: • Decompose application into many small tasks • Specify dependences between these tasks • Data dependences • Control dependences • Use transactions to prevent failures from exposing partially updated data structures • Use dependence information to continue past failures
Bristlecone Language • Program is specified as a set of tasks • Task specifications describe task dependences • Tasks have transactional semantics • Runtime system reasons about dependences to execute past failures (automated recovery)
Web Server Example • Decoupled operations • Log Request and Send Page tasks are independent • Failure of one does not affect the other Accept Connection Read Request Log Request Send Page
Specifying Object States • Different object states support different functionality (Type State) • Use flag construct to label conceptual object states • Use these flags to determine when to perform operations • Can differentiate between operations that have true data dependences and operations that just operate on same objects class WebRequest { Flag initialized; Flag send_page; Flag write_log; …}
Tagging Objects • Motivation: Consider the web server example • Each connection has: • A Socket object that provides communication • A WebRequest object that stores application specific state • Need to pair the correct Socket and WebRequest objects together • Solution: Tag the group of objects Connection Tag Socket Object WebRequest Object
Tagging Objects • Tags group object instances • Tags provide mechanism • Tags have types • Can create many instances of a tag type • Each instance defines a group • Can bind tag instances to objects • Tags can specify that task parameters must be in the same group
Task Specifications • Describe data dependences of tasks • Describe affect of tasks on objects /* This task reads a request from a client. */ task readRequest(WebRequest w in initialized with connection t, Socket s in IO_Pending with connection t) { ... taskexit(w: initialized:=false, send_page:=true, write_log:=true); }
Bristlecone Task Semantics • Runtime invokes tasks • Tasks can be invoked when objects are available in the heap that satisfy the task’s parameter guards • Task have transactional memory semantics • All operations are executed or none • Task execution appears to occur in a single instance • Failures cause transactions to abort and restore consistency
Failure-free Execution Accept Connection Read Request Log Request Send Page
Failure-free Execution Accept Connection Read Request Log Request Send Page
Failure-free Execution Accept Connection Read Request Log Request Send Page
Failure-free Execution Accept Connection Read Request Log Request Send Page
Error Detection • Catching operating system signals • Arithmetic exceptions • Null pointer exceptions • Library signals • Socket errors • … • Runtime language checks • Array out of bounds exceptions • Assertions • Imperative consistency checks • Declarative data structure specifications
Failure Recovery • Transactions restore data structures to previous consistent state • Problem: Re-executing the same task will likely result in the same failure • Solution: Use task specifications to determine what other tasks can be safely executed
Automatic Recovery Accept Connection Read Request CRASH logRequest Send Page
Automatic Recovery Accept Connection Read Request Log Request Send Page
Automatic Recovery Summary Accept Connection Read Request Log Request Send Page
Language Benefits • Use specifications to understand failure in a meaningful way • Use task specifications to reason how to recover from failures • Task specifications eliminate artificial dependences
Task Dispatch • Goal: Determine which parameter objects satisfy task guards • Problem: Brute force search can be expensive • Our Approach maintains: • Parameter set of objects that satisfy an individual parameter’s guard • Active task queue of sets of parameter objects that collectively satisfy all of task’s guards
Task Dispatch • Precisely maintain parameter sets • If an object is in a parameter set • It satisfy the flag component of the guard • Is bound to the correct types of tags • All objects that satisfy parameter’s guard are in parameter set • Active task queue is conservative • If a set of objects could potentially satisfy all of task’s guards, it is in the task queue • Must check that set of objects in a task queue invocation satisfies guards before invoking task
Task Dispatch • When a new object is added to parameter set, create corresponding task queue invocations • Search for objects that satisfy tag guards • Idea: Use tags to prune search • When we add an object with a tag guard to the set, use tags to prune search of other parameter objects that must be bound to the same tag
Task Binding Iteration • Structure computation as a list of iterators over tags and objects • Multiple types of iterators: • Over tags bound to object • Over objects bound to tag • Over objects in parameter set • Want to prune search early – ordering is important • Statically generate iterator orderings for each parameter set of each task
Initial Experiences • Implemented Bristlecone compiler and runtime • Have evaluated system on several benchmarks including: • Web Server • Web Spider • Chat Server • Developed a Bristlecone and Java versions of each • Java versions were designed to use threads to provide resilience to failures • Randomly injected failures into executions
Web Spider • Workload is a set of 100 web pages • Java version implemented using a thread pool architecture • 100 trials on each version • Randomly injected 3 halting failures into each execution • With injected failures • Java version fetched average of 6 pages • Bristlecone version fetched average of 91 pages
Web Server • Web Server with support for e-commerce transactions • Java version spawns a thread for each connection • 200 trials on each version • Randomly injected 50 halting failures into each execution • With injected failures • Java failed to serve inventory requests in 4.5% of trials, Bristlecone failed in 1.5% • Java had correct inventory responses in 68.6%, Bristlecone in 100%
Chat Server • Chat server allows multiple users to chat • Java version spawns a thread for each connection • 100 trials on each version • Workload sent 800 messages • Randomly injected 10 failures into each execution • With injected failures • Java version failed to serve 39.9% of messages • Bristlecone version failed to serve 19.3% of messages
Related Work • Traditional fault tolerance • N-version programming • Recovery blocks • Exception handlers • Languages • Linda / Tuple spaces • Orc • Actors • Argus • Oz • Erlang • Software and Hardware Transactional Memory
Conclusions • Bristlecone is a exciting approach to improve application reliability • Initial experiences promising