440 likes | 667 Views
Systems Seminar Schedule. 1 October - Douglas Thain “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau “Information and Control in Gray-Box Systems” 29 October - John Bent “Creating Communities for Grid I/O” 12 November - Open 26 November - Open
E N D
Systems Seminar Schedule • 1 October - Douglas Thain • “Error Management in Virtual Operating System” • 15 October - Andrea Arpaci-Dusseau • “Information and Control in Gray-Box Systems” • 29 October - John Bent • “Creating Communities for Grid I/O” • 12 November - Open • 26 November - Open • 10 December - Open
Error Managementin a VirtualOperating System Douglas Thain Condor Project University of Wisconsin
App 1 Virtual OS 1 Device Drivers What is a Virtual OS? App 2 App 3 Virtual OS 2 App 4 Device Drivers Operating System Device Drivers Hardware
Why Use a Virtual OS? • To test and deploy software that would otherwise require destructive changes. (Wine, User Mode Linux) • To improve indirection or fault-tolerance. (Rocks, Socks, Grid Console) • To transparently harness exterior resources. (UFO, Condor, PFS)
Harness the Grid Virtual OS 1 Virtual OS 2 App 4 App 3 App 1 App 2
App In a Standard OS,Errors are not Difficult errno • Layers are members of a unified engineering effort. • A standard namespace and scheme are used end-to-end. • Most interfaces closely resemble the underlying implementation. • Most catastrophic failures are coordinated. Standard Library errno OS Kernel errno File System errno Device Driver
Handling Errors is a Serious Problem On the Grid • It is an important problem to solve: • As systems grow more complex, MTBF->0. • Failures are generally uncoordinated. • Propagating knowledge of failure is more important than increasing likelihood of success. • It is a difficult problem to solve: • Theoretical: Matching different abstractions. • Technical: Mating different langauges and conventions. • Social: Coordinating distinct engineering efforts.
App Error Management:A Problemof Depth POSIX Tape Archive Virtual OS DDI DDI Disk Cache FTP Driver DDI Globus Globus FTP Library Globus FTP Library Unitree OS FTP Server FTP Globus Unitree
A Problem of Width App errno Virtual Operating System UNIX Driver SRB Driver FTP Driver NeST Driver Kangaroo Driver Globus GASS Driver An Alphabet Soup of Protocols, APIs, Systems, Authorities, and Authors
A Problem ofDesign Direction App App Bottom Up Design ??? errno Application Library Virtual OS Outside In Design errno DDI Standard Library FTP Driver errno Globus OS Kernel FTP Library
How do wecorrectlyrepresent errorsin avirtual operating system?
Spirit of this Talk • Software design involves striking balances -- there is no trivial answer. • Concentrate on presenting several concrete problems and working solutions. • Given these “data points,” I will present some reasonable generalizations. • Languages and conventions are ancillary issues. • e.g. Exceptions vs. signals vs. errnos • Discussion and disagreement are welcome!
App Bypass The Pluggable File System Local Driver GridFTP Driver SRB Driver Kangaroo Driver NeST Driver HTTP Driver GridFTP Library SRB Library Kangaroo Library NeST Library HTTP Library Grid Services Host Operating System
Examples of PFS % vi /gsiftp/vulture.cs.wisc.edu/etc/hosts % grep phone /http/www.cs.wisc.edu/ % gcc /nest/turkey.cs.wisc.edu/input.c -o /kangaroo/khaki.ncsa.uiuc.edu/output
A Kernel on Top of a Kernel The Pluggable File System File Descriptors 0 1 2 3 4 5 6 7 8 9 10 11 12 namei 65 1001 0 150 126 File Pointers Current Working Directory /tmp/input /gsiftp /host/ out.10 /srb /host /tmp/data /kangaroo /host /etc/hosts File Objects Mount Table Local Driver GridFTP Driver SRB Driver Kangaroo Driver NeST Driver HTTP Driver Host Operating System
Not a Complete Virtual OS • Does not address process management, synchronization, etc... • Complete enough to be put to good use with real, non-trivial applications. • Gaussian - atomic model simulation • CMSIM - simulation of CERN LHC • POVray - ray tracing software • Structure and concept are developed enough to explore other OS issues… others welcome!
Top-Level Error Space • A single namespace of integer errors that apply to all levels of the system. • Any call is free to return any possible error. (124) • General vs specific: • ENOENT vs ECHILD • Some artifacts: • EACCESS vs EPERM • EADV and EDOTDOT EPERM 1 /* Operation not permitted */ ENOENT 2 /* No such file or directory */ ESRCH 3 /* No such process */ EINTR 4 /* Interrupted system call */ EIO 5 /* I/O error */ ENXIO 6 /* No such device or address */ E2BIG 7 /* Arg list too long */ ENOEXEC 8 /* Exec format error */ EBADF 9 /* Bad file number */ ECHILD 10 /* No child processes */ EAGAIN 11 /* Try again */ ENOMEM 12 /* Out of memory */ EACCES 13 /* Permission denied */ ..
Concrete Problemsand Solutions • Too little information - file transfer replies (FTP) • Stick your head in the sand. • Grope in the dark. • Never forget a face. • Too much information - infinite namespace (SRB) • Divide and conquer. • Appeal to a higher power. • New failure modes - login errors (Globus) • Take it easy. • Split hairs.
Too Little Information:FTP Replies • Integer codes indicate the severity of a response to an action. • Many transfer problems are identified, but few file system problems are. • Third digit specified infrequently, and for wide classes of errors. 100 - Positive Preliminary 200 - Positive Completion 300 - Positive Intermediate 400 - Transient Negative 500 - Permanent negative 000 - Syntax 010 - Information 020 - Connections 030 - Authentication 040 - Unspecified 050 - File System 550: “e.g. File not found, no access”
ENOENT, EACCES, EISDIR...? open datafile open datafile ? 550: Pas de tellement lime ou repertoire... GET datafile Too LittleInformation:FTP Replies App Virtual OS FTP Server FTP Driver
Too little Information:“Stick your head in the sand” • If you don’t understand the failure, keep trying until the result is acceptable. • Might work for transient errors. • Might even work for the savvy user that can identify and fix problems.
Too little Information:“Grope in the Dark” if GET succeeds return success else if CHDIR succeeds return EISDIR else if LIST succeeds return EACCESS else return ENOENT end end end GET LIST CHDIR EACCESS
Too little Information:“Never Forget a Face” • Each error condition has a signature: • Server identifier: “wuftpd 4.1 ftp.cs” • Operation attempted: “GET” • Message in reply: “550: Pas de tallenmand...” • First “Grope” and then cache the determined error along with the signature. • Problems: • Server must be consistent • Groping is not atomic
Too Much Info:SRB Replies • Multiplexes many server backends into one client interface. • Error space is an amalgam of all back end error spaces. • Any call may return any error. • 1026 and growing! UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356 HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499 SQL_RSLT_TOO_LONG -1600 HTTP_ERR_BAD_PATH -1700 MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . . MCAT_USER_NOT_IN_DOMN -3032
Too Much Information:“Divide and Conquer” EPERM UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356 ENOENT ESRCH HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499 EINTR EIO EACCESS SQL_RSLT_TOO_LONG -1600 HTTP_ERR_BAD_PATH -1700 EISDIR MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . . MCAT_USER_NOT_IN_DOMN -3032 OTHER
“Appeal to a Higher Power” Throw an exception. Kill the process. Dump core. open datafile “Cannot assign a COS.” A)bort R)etry F)ail? EACCESS, ENOENT, or EISDIR? open datafile OTHER HPSS_NOCOS open datafile App Human Virtual OS SRB Server SRB Driver
EPERM, EACCES, EPROTO...? open datafile open datafile Find Identity ? Protocol Negotiation Authentication Authorization GET datafile App New Failure Modes:Login Errors Virtual OS GSI Driver GSI Resource Identify Certificate
New Failure Modes:Login Errors • Hierarchy of error objects, much like Java. • Errors may be identified by individual type or their membership in a more general type. class Error { Error trigger; Module place_in_code; Object thing_in_question; String message; }; Error Authen- tication Author- ization Commun- ication No Creds Expired Creds No Trust
New Failure Modes:“Take it Easy” • Easy for program to interpret and react. • Difficult for a human to debug. Not Authorized Couldn’t Authenticate EACCES Protocol Not Supp. No identity
New Failure Modes:“Split Hairs” • Preserves unique error types for the savvy user. • Program may not be prepared to react to arbitrary error values. EACCES Not Authorized EPERM Couldn’t Authenticate EPROTO Protocol Not Supp. ESRCH No identity
New Failure Modes:Rocks Solution App • “Reliable Sockets” by Vic Zandy • Give a general error code along the standard channel. • Give a detailed message along a back channel. Connection Lost Reliable Sockets rserrno Reconnection Timeout Expired Connection Refused Standard Sockets
A Toolbox forError Conversions • Simple Conversions: • “Take it Easy” • “Split Hairs” • “Divide and Conquer” • “Grope in the Dark” • “Never Forget a Face” • “Appeal to a Higher Power” • “Stick your Head in the Sand” Increasing Cost
Error Accuracy can beA Performance Concern • We can always find some way to produce a correct -- even if undesired -- execution. • But - • An “Appeal to a Higher Power” causes badput. • “Groping in the Dark” yields high latencies. • “Head in the Sand” may keep trying when no automatic recovery is possible. • ...or, a failure to retry results in unnecessary user interaction.
Hints for Error Design 1 - Express errors in terms of the interface. 2 - Assume the audience is a program. 3 - Leave room to expand, but avoid using it. 4 - Give the essence, not the details.
1 - Express Errors in Termsof the Interface Application • Essence of separation of interface and implementation. • The user of an interface should not see a “moving target” as the implementation changes. File Interface Disk Impl Network Impl Memory Impl ???
2 - Assume the Audienceis a Program • A computer-readable error can be used as the basis for a decision at any level. • A human-readable error can only result in a blind retry or an Appeal. • Computer-readable errors are easily made human-readable. Human Decision Decision Layer2 Decision ??? Layer 1 Decision ??? Error Text Error Code Layer 0
3 - Leave Room to Expand...but Avoid Using It • Any significantly different implementation of an interface will introduce new failure modes. • Possibilities for a new failure: • Best case: fit it into an existing error. • Medium case: return “unknown error.” • Worst case: “Appeal to a Higher Power.”
4 - Give the Essence,not the Details • The details distract the caller from the nature of the problem and result in cascading “Appeals.” • Example in file systems: • “Fell off the end of the directory linked list.” • or “No file by that name.” • Example in networking: • “Timer went off, but no network interrupt received.’ • or “Connection lost.” • Example in security: • “Failure in PEM_do_header while reading password.” • or “You have no credentials.” • A restatement of hint #1.
Hall of Fame • All authors remain anonymous. • “Error in return value.” • “A system call failed!” • “Could not execute job. Reason: Success”
In Summary... • Error management is part of the “art” of software engineering. • The importance and the difficulty of error management are magnified in a virtual operating system. • All errors have some value, but low-signal errors result in performance problems. • Hints for error interface design.
Contact Info • Douglas Thain • thain@cs.wisc.edu • Software and other info: • http://www.cs.wisc.edu/condor/pfs • Questions and discussion?