250 likes | 356 Views
Parrot: Transparent User-Level Middleware for Data-Intensive Computing. Douglas Thain Condor Project, University of Wisconsin Workshop on Adaptive Grid Middleware 28 September 2003. The Reality of the Grid. afwuhweiuhsdvxmndf (and then a miracle happens) P=NP.
E N D
Parrot:Transparent User-Level Middlewarefor Data-Intensive Computing Douglas Thain Condor Project, University of Wisconsin Workshop on Adaptive Grid Middleware 28 September 2003
The Reality of the Grid afwuhweiuhsdvxmndf (and then a miracle happens) P=NP I think you have a problem here... Look at my new proof!
User’s App (open, close, read, write, lseek) I/O Interface Storage Server Local Operating System access data Chirp FTP NeST RFIO DCAP Condor PBS NQE LSF Load Leveler run this batch job Local Operating System Process Interface (main, exit, abort, kill, sleep) User’s App Parrot
Applications of Parrot • Interactive Browsing • tcsh, tar, gzip, make, acroread, gv, xv... • Improved Reliability • Transparent retry/reassignment/reallocation • Files, sockets, even repair broken apps. • Private Namespaces • Make /home/thain appear the same everywhere. • Make /usr/data/calibration different everywhere. • Dynamic/Distributed Program Construction • Remote link, remote exec, remote eval... • Profiling and Debugging • Users may not know low-level I/O patterns.
Challenges • Technical Methods of Interposition • Semantic Differences • Error Management • CPU – I/O Integration • Performance • The butterfly effect: • Subtle underlying differences can have large effects in performance and usability.
Internal Techniques Binary Rewriting Polymorphic Extension App Code App Code Standard Library Library M1 M2 NEW New Code App Code New Library Standard Library Static or Dynamic Re-Linking
External Techniques Debugger Trap Remote Filesystem App App Agent Kernel Kernel Callout Kernel NFS LFS FFS App Agent NFS LFS FFS agent Kernel NFS LFS USR
Hole Detection Matters • Dynamic Linking • Bypass Toolkit, ca. 2000 • Works with some standard tools. • Many still crash in strange ways. • Doesn’t apply to static exes; always a surprise. • Debugger Trap • Parrot: Coding began in May of 2003. • Works reliably with almost everything in /usr/bin. • Caveat #1: Twice as much code • Caveat #2: Higher latency
Debugger Trap • For the rest of this talk, we select the debugger trap for completeness and reliability. Much of the discussion still applies to the other techniques too. • Some technical details in the paper: • Only on Linux. • Must manage process ancestry. • Must fudge some broken ptrace behavior. • Cannot write directly to process, must take roundabout path through temp file.
User Process SYS_write SYS_read SYS_open (debugger trap) parrot_read parrot_open parrot_write File Descr. 0 1 2 3 4 5 6 7 8 9 ... name resolver File Pointers pos: 100 pos: 0 pos: 0 pos: 1 MB pos: 42 mount list driver chirp lookup driver File Objects “outfile” “infile” “config” “data” Local Driver Chirp Driver FTP Driver NeST Driver RFIO Driver DCAP Driver Device Drivers
Adaptation On distant host: On nearby host: /mydata -> /ftp/host2/opt/DAT /mydata ->/chirp/host1/usr/mydata App App open(“/mydata/foo”) open(“/mydata/foo”) Parrot Parrot Local Chirp Local FTP Chirp FTP chirpd ftpd /opt/DAT On same host: /mydata -> /usr/data App open(“/mydata/foo”) Parrot Local FTP Chirp /usr/data
What Protocol? • File Transfer Protocol: • Internet standard, many implementations. • High bandwidth sequential access. • NeST • General purpose storage appliance from UW. • Virtual users, namespace, and allocation. • RFIO: • Remote I/O protocol used with CERN CASTOR. • UNIX like, most ops require a new TCP. • DCAP • Remote I/O protocol used with Fermi D-Cache • UNIX like, WORM semantics, no directories, caching/ • Chirp: • Protocol developed @ UW for Parrot. • Corresponds very closely to UNIX, incl errnos.
Small Details Matter • Standard tools need to know subtle details, otherwise, they break: • ls –lR performs getdents(“foo”) • on success: descend • on ENOTDIR: display and continue • on ENOENT: display error and stop. • FTP does not provide this detail • Failed LIST -> error 550 • Failed GET -> error 550 • Failed CDIR -> error 550 • Simple assignment doesn’t work: • Making 550=ENOENT breaks many tools.
Example Solution LIST “foo” 200 Success other 550 CWD “foo” Transient Error 550 other Not a dir. 200 SIZE “foo” other 200 Access denied. No such entry. 550
CPU-IO Integration • Errors that cannot be expressed in the client’s interface must be passed to a higher level (the batch system.) • Simple options: • kill –9 application (retry app elsewhere) • exit(1) application (don’t retry app) • Complex options: (Condor only) • restart with (Subnet!=“128.101.175”) • restart with (CurrentTime>5pm)
Bandwidth by Protocol (unix default hint) (parrot default hint)
Andrew-Like Benchmark • Original Andrew benchmark is no longer appropriate, so replace with the Parrot source: 296 files, 955 KB. • Copy the source to a remote device, then manipulate in five stages: • copy: cp –rp • list: ls –lR • scan: grep searchstring –r * • make: make • delete: rm –rf *
Moral of the story: • The butterfly effect: Small underlying differences can have big effects on performance and reliability. • Examples in interposition: • Dynamic linking: fast but poor hole detection. • Debugger trap: slow but good hold detection. • Examples in protocols: • Chirp: UNIX semantics restrict bandwidth. • FTP: Need for multiple ops increases latency. • NeST: Powerful virtualization increases latency. • RFIO: Connection per op doesn’t scale.
For more info... • Douglas Thain • thain@cs.wisc.edu • Miron Livny • miron@cs.wisc.edu • Software, manuals, more info: • http://www.cs.wisc.edu/condor/parrot • The Condor Project: • http://www.cs.wisc.edu/condor