1 / 35

Errors, Status, and Asynchrony Discussion Session

Errors, Status, and Asynchrony Discussion Session. PPDG Data Replication Meeting 10 January 2002 Douglas Thain, Condor Project University of Wisconsin. Agenda. A Working Model Two Error-Management Issues Thinking of Data-Movement as “Jobs” Reconciling Error Representations

bunme
Download Presentation

Errors, Status, and Asynchrony Discussion Session

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Errors, Status,and AsynchronyDiscussion Session PPDG Data Replication Meeting 10 January 2002 Douglas Thain, Condor Project University of Wisconsin

  2. Agenda • A Working Model • Two Error-Management Issues • Thinking of Data-Movement as “Jobs” • Reconciling Error Representations • Example Problem • Discussion • Open Issues: • Hints and Absolutes in Replica Management • Tradeoff between consistency and availability

  3. Discussion Points • Data Job Management and Fault-Tolerance • What faults do we intend to tolerate/expose/ignore? • Can we develop a general transaction infrastructure for replication-related activities? • How should we evaluate designs that may be error sensitive? (design review, stress testing) • Error Identification and Representation • Should we have a uniform error space? • Is it feasible to translate between existing error spaces? • What systems have unusual errors modes that outsiders may not expect? • How do we deal with unusual errors that must pass through existing APIs?

  4. A Working Model: Giggle Replica Site A GRIN L1 B L2 B L3 B Replica Site B L1 P1 L2 P2 L3 P3 Foster, Iamnitchi, Ripeanu, Chervenak, Deelman, Kesselman, Hoschek, Kunszt, Stockinger, Stockinger, Tierney, “Giggle: A Framework for Constructing Scalable Replica Location Services”

  5. The Problem • Replication systems will be subject to a wide variety of errors. • How do we build systems that maintain consistency in the face of errors? • Answer: Use transactions to manage jobs, but... • How do we build systems that make reasonable performance decisions in the face of errors? • Answer: Informative errors, but…

  6. Fault Tolerance Terminology • Failure • An externally-visible deviation from specifications. • Error • An internal data state that leads to a failure. • Fault • An external event that creates an error. A. Avizienis and J.C. Laprie, Dependable computing: From concepts to design diversity, Proc IEEE 74, 5 (May) 629-638

  7. What is sqrt(4)? Answer: 3 Example FAULT Client Server Hmm, sqrt(4) is... Hmm, sqrt(9) is... FAILURE ERROR

  8. Silent errors (failures) • The system claims to have reached a valid result, but an auditor claims it is invalid. • Explicit errors (failures) • The system tells us it cannot complete the desired action. • Escaping errors (failures) • The system detects an error, but has no method of reporting it, so it escapes by an alternate route -- drop connection, core dump, kernel panic. (exception) John B. Goodenough, Exception Handling: Issues and a Proposed Notation, CACM 18:22 (1975), pp 683-696.

  9. What Errors to Expect in a Replication System? • Errors of communication: • File transfer was broken between bytes. • Collection transfer was broken between files. • Errors of omission: • Requested some files, but response was slow, so the caller gave up and left. (with/out abort?) • Errors in configuration: • Space at target server can’t admit all incoming data at once.

  10. What Must Be Consistent? Giggle does not require that a GRIN be up-to-date, but it is useful to consider. Replica Site A Index of files and the files themselves must be kept consistent Replica Catalog L1 B L2 B L3 B Replica Site B L1 P1 P3 L2 P2 P2 L3 P3 P1

  11. Data Movement as a Job • Each request issued for replication must have a past, present, and future: • Who issued it, and why? • What is it doing now? • Is it done? Did it succeed? • Enough information to roll back after a failure. • A complete program execution: • data jobs + cpu jobs + dependencies = DAGMan/DaPMan

  12. Job Management • Primary technique for reliable interacting with the job queue: transaction. • ACID Test: Atomicity, Consistency, Isolation, Durability. • Of course, the natural interface to a db, but not all participants are a full db. • Interface: • 2PL and friends • Implementation: • Logging, shadowing, a real db?

  13. prepare(data) id or failure tid commit(tid) ok Two-Phase Commit Server Client Stable Storage Work Space Archival Space J. Eliot Moss, Nested Transactions: An Approach to Reliable Distributed Computing, MIT Press, 1985.

  14. begin() tid add(tid,data) PREPARE ok end(tid) tid ok commit(tid) ok COMMIT Two-Phase Commit Server Client Stable Storage Work Space Archival Space James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001.

  15. Transactions and Status • The transaction ID then becomes a persistent “job number” for later queries: • Success, failure, abort, timeout… • unknown-past, unknown-future. • For this status to be useful, a record of the job must be kept around for a certain period of time. • Also ok to time out, cancel, or otherwise remove data movement jobs. • But, a committed transaction must be kept. • Can’t re-use a job number!

  16. Atomic pointer update Transaction Implementations • Logging • Keep a log of all actions, new and old values. • Read forward to redo, backwards to undo. • Shadowing • Add changed data to unallocated space. • Atomically commit new pointers to data. M D D D D D D D

  17. Transaction Implementations • If a standard file system is the underlying storage, then shadowing is a natural fit. • Most metadata updates are designed to be atomic and synchronous. • Most large data updates are designed to provide good xput, but are asynchronous and not guaranteed until after an explicit commit.

  18. Atomic File Update fd = creat(“file.tmp”) write(fd,data,length) fsync(fd) close(fd) rename(“file.tmp”,”file”) On Failure or abort On Success On reboot Done. unlink(“file.tmp”) unlink(“*.tmp”) (Technique used on Condor checkpoint servers and scheduler processes.)

  19. Unifying Storage Services App POSIX Virtual Operating System UNIX Driver SRB Driver GridFTP Driver NeST Driver Kangaroo Driver GASS Driver An Alphabet Soup of Protocols, APIs, Systems, Authorities, and Authors

  20. App Tape Archive Error Representation:A Problemof Depth POSIX Bypass Agent ??? Disk Cache PPDG API Win32 Replica Access Library FTP Server Replica Catalog RAP FTP Replica Server Replica Server RMP RMP

  21. A Problem ofDesign Direction App App Bottom Up Design ??? POSIX Application Library Virtual OS Outside In Design ANSI PPDG API Standard Library Replica Access POSIX SRB OS Kernel Replica Server

  22. The End-to-End Argument • In complex software, the outermost layer has the ultimate responsibility for interpreting and recovering from errors. • Recovery in a lower layer is an optimization of performance or convenience. • If the possibility of error is very high, lower-level recovery is needed for good performance. Saltzer, Reed, and Clark, End-to-End Arguments in System Design, Computer Systems 2:4, pp 277-288, 1984.

  23. UNIX Errnos • A single namespace of integer errors that apply to all levels of the system. • Any call is free to return any possible error. (124) • General vs specific: • ENOENT vs ECHILD • Some artifacts: • EACCESS vs EPERM • EADV and EDOTDOT EPERM 1 /* Operation not permitted */ ENOENT 2 /* No such file or directory */ ESRCH 3 /* No such process */ EINTR 4 /* Interrupted system call */ EIO 5 /* I/O error */ ENXIO 6 /* No such device or address */ E2BIG 7 /* Arg list too long */ ENOEXEC 8 /* Exec format error */ EBADF 9 /* Bad file number */ ECHILD 10 /* No child processes */ EAGAIN 11 /* Try again */ ENOMEM 12 /* Out of memory */ EACCES 13 /* Permission denied */ ..

  24. FTP Reply Codes • Integer codes indicate the severity of a response to an action. • Many transfer problems are identified, but few file system problems are. • Third digit specified infrequently, and for wide classes of errors. 100 - Positive Preliminary 200 - Positive Completion 300 - Positive Intermediate 400 - Transient Negative 500 - Permanent negative 000 - Syntax 010 - Information 020 - Connections 030 - Authentication 040 - Unspecified 050 - File System 550: “e.g. File not found, no access”

  25. SRB Reply Codes • Error space is an amalgam of all back end error spaces. • Pros: No information is ever lost in translation. • Cons: Very difficult to write code that switches on the error number (1026 cases.) UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356 HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499 SQL_RSLT_TOO_LONG -1600 HTTP_ERR_BAD_PATH -1700 MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . . MCAT_USER_NOT_IN_DOMN -3032

  26. Globus Error Objects • Pros: • Errors may be identified at varying levels of granularity. • Easily expandable. • Lots of debug info. • Cons: • Can be difficult to decide in which class to place an external error. • In practice, most errors are returned as objects of type “string”. Error String Authen- tication Author- ization Commun- ication No Creds Expired Creds No Trust

  27. Translation Can be Done… to a Point EPERM UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356 ENOENT ESRCH HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499 EINTR EIO EACCESS SQL_RSLT_TOO_LONG -1600 HTTP_ERR_BAD_PATH -1700 EISDIR MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . . MCAT_USER_NOT_IN_DOMN -3032 OTHER

  28. Grope in the Dark if GET succeeds return success else if CHDIR succeeds return EISDIR else if LIST succeeds return EACCESS else return ENOENT end end end GET LIST CHDIR EACCESS

  29. Error Identification isa Performance Concern • We can always find some way to produce an execution that avoids a silent failure. • Pass all errors up one level. • Retry all errors until time expires. • Abort process completely. • But, a known, finite, space allows the caller to make targeted decisions about what to do next: • “Not Authorized” -- best to pass up one level. • “Operation Interrupted” -- best to retry here.

  30. Give the Essence orGive the Details? • Example in file systems: • “Fell off the end of the directory linked list.” • or “No file by that name.” • Example in networking: • “Timer went off, but no network interrupt received.’ • or “Connection lost.” • Example in security: • “Failure in PEM_do_header while reading password.” • or “You have no credentials.” • Example in Storage: • HPSS_NOCOS • or ?????

  31. Example and Discussion

  32. Example • Goal: • User requests a repl of a file from B to A. • Data Structures at each Node: • A persistent map map from LFNs to PFNs. • A persistent store for transactions. • A persistent store for data. • Assumptions: • Files are read-only, no need for invalidation. • All nodes must survive reboot cleanly. • File transfers may be resumed from any point.

  33. At site B. I want LFN 2 Got it. Where is LFN 2? Replica Site A Get LFN 2 Replica Catalog L1 B Client L2 B L3 B Replica Site B L1 P1 L2 P2 L3 P3

  34. prepare(get L2) T53 commit(T53) ok Replica Site A LFN TRN Server Client L2 T53 T53.tmp LFN = L2 PFN = P16 State = Working T53.tmp LFN = L2 PFN = P16 State = Working T53.tmp LFN = L2 PFN = P16 State = Done P16 Physical Data File T53 LFN = L2 PFN = P16 State = Working T53 LFN = L2 PFN = P16 State = Done

  35. More Issues • Cleanup at Reboot: • Remove uncommitted transactions. • Jobs in progress: Update LFN->TRN entry. • Client Status Check: • Requesting client examines state of transaction. • Or, other clients indirect through LFN entry. • Notification of Status Change: • Unreliable -- Server sends messages to client. • Reliable --Server must do transaction to client. • (See Condor-G Paper)

More Related