450 likes | 544 Views
Real-life experiences with grids: It’s not as easy as it looks. Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team. Who Am I?. Member of Condor Team Experience with Condor Experience with grid deployment Developer of Virtual Data Toolkit Used by GriPhyN, EDG, LCG…
E N D
Real-life experiences with grids:It’s not as easy as it looks Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team Grid Experiences
Who Am I? • Member of Condor Team • Experience with Condor • Experience with grid deployment • Developer of Virtual Data Toolkit • Used by GriPhyN, EDG, LCG… • Packaging of Globus, Condor, etc. • Collaborator with INFN • Working with Paolo Mazzanti • In Bologna for four weeks Grid Experiences
Italy • Italy is beautiful • The food is wonderful • The people are friendly Grid Experiences
Background • Condor’s environment is a little like a grid • Not all computers (grid sites) are under Condor’s control • Computers (grid sites) disappear at the owner’s whim • Everything changes constantly • Condor was built to deal with this dynamic environment • Grid software needs to do the same Grid Experiences
Background • Late 1980s until today • Condor developed and deployed on hundreds of sites • Condor built to deal with failures • Recently • Condor-G: your window to the grid • Condor team has helped deploy grid technology for real use—not just experiments Grid Experiences
Background: Condor • Condor is a batch job system • Goal: High throughput computing • Different than high-performance • Goal: High reliability • Goal: Support distributed ownership Grid Experiences
High-Throughput Computing • Worry about FLOPS/year, not FLOPS/second • Use all resources effectively • Dedicated clusters • Non-dedicated computers (desktop) Grid Experiences
Effective Resource Use • Requires high reliability Computers come and go, your jobs shouldn’t. • Checkpointing • Be prepared for everything breaking • Requires distributed ownership Grid Experiences
Condor-G • Condor-G submits Globus jobs • Jobs are in persistent queue • Unlike globus-job-run • Jobs are retried on system failures • Jobs are held on some failures • Condor-G makes it easy to submit grid jobs Grid Experiences
Background: USCMS • CMS: • Detector online in 2007 • Needs to simulate & reconstruct millions of events • USCMS testbed • Joint PPDG/GriPhyN effort • Integrate CMS tools with grid tools • Globus • Condor-G • Contribute real work to CMS Grid Experiences
Background: USCMS • 7 sites, 250+ CPUs • Spring 2002: Deploy & test • Fall 2002 • Last minute production • 150,000 events in two weeks • Successful, but lots of work • Today: • Wider deployment & use Grid Experiences
Background: DØ • Experiment at Fermilab • Already doing real production, real analysis • Deploying on grid sites today • Condor-G • Globus • SAM Grid Experiences
DØ: Condor-G • They liked Condor-G: • Condor-G missing a feature: • Deciding which grid-site to use • SAM (data handling software) knows where data is located • SAMGrid: • Condor-G asks SAM for advice • Condor-G decides where to run jobs Grid Experiences
DØ: deployment • Spring: Beginning of deployment • Late summer: production • Early results: • It looks good • We have more work to do • Better error reporting • Better matchmaking • What will we learn later? Grid Experiences
Problems & Lessons • During our experiences, we’ve: • Encountered many problems • Developed solutions to these problems • Learned many lessons about grids • This talk: • Shares some interesting problems • Gives some advice & solutions Grid Experiences
Problem Taking a taxi • How do you take a taxi in Paestum, Italy? • We don’t need to: walk 4km there • The ruins were lovely • The ruins were outside • It was about 35°C • Wife is pregnant Grid Experiences
Lesson Use all your resources • Walk up to storekeeper • Ask: Dovay Ooon Taxi? (Dove un taxi?) • Be patient: Wait ten minutes • Take taxi • I assumed my resources (local knowledge, Italian) were insufficient, but they saved me time when I used them Grid Experiences
Lesson Use all your resources • Condor: • Uses dedicated machines (I can walk) • Uses non-dedicated machines (I can sometimes ask for help) • Grids: • Connect your machine rooms • Can you take advantage of other resources? • Avoid mentality “I must control all resources”, and you will prosper Grid Experiences
Grid: distributed machine room? • You can have good control • You can pre-install applications • You know how everything works BUT… • You lose flexibility • How quickly can you upgrade sites? • Did they install everything correctly? • Can you use new grid sites easily? Grid Experiences
Grid: Use all resources • Assume: basic grid software is installed • Assume: nothing else is installed • Bring your software with you • Submit one job: install software • Submit N jobs: use software • You control software • You ensure correct installation • Easy to use any grid site Grid Experiences
Problem Long-running programs • Long-running programs crash • Condor has daemons on each machine: • User (job) agent • Machine agent • Matchmaker • They crash: • Programming errors • Network failures • Disk failures • … Grid Experiences
Lesson Watch programs • Condor master • Small program, rarely changed • Runs Condor daemons • When daemon crashes: • Restart daemon, send email • If it crashes again, restart after backoff • Result: • Many errors are silently fixed • Yet we don’t just ignore crashes Grid Experiences
Problem Short-running programs • Short-running programs crash/hang • Example: globus-url-copy • USCMS testbed: staging data • Some fraction of copies hang or fail • Programming error + delicate network • Hard to reproduce and fix Grid Experiences
Lesson Watch programs • When copy exceeds timeout, kill and retry • Possible to do in shell scripting languages, but not easy • Use Fault Tolerant Shell to watch programs Grid Experiences
Exponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2] Fault Tolerant Shell • Shell language built for coping with errors try for 30 minutes wget http://www.example.com/file.tar.gz gunzip file.tar.gz tar xf file.tar end Grid Experiences
FTSH: exponential backoff • Why exponential backoff? • What if 100 ftsh scripts are executing? • Avoid synchronization reduce load, increase chance of success • Similar to Ethernet Grid Experiences
Fault Tolerant Shell • Easier to cope with failures: try 5 times wget http://www.example.com/file.tar.gz catch rm –f file.tar.gz failure end Cleanup partially downloaded file, if it exists Grid Experiences
Cope with network failure Cope with disk failure Fault Tolerant Shell • Flexible try for 30 minutes try for 5 minutes wget http://example.com/file.tar.gz end try for 1 minute or 3 times gunzip file.tar.gz tar xf file.tar catch rm –rf file.tar end end Grid Experiences
FTSH: More information • Work of Doug Thain • thain@cs.wisc.edu • Excellent paper: • The Ethernet Approach to Grid Computing, by Doug Thain • Available from: http://www.cs.wisc.edu/~thain • Even if you don’t use FTSH, read this paper! Grid Experiences
Problem Whose error is it? • The source of an error is not always obvious • The source of an error influences how you react to the error • Example: Java universe in Condor Grid Experiences
Java Universe • Users submit Java jobs to Condor • Whose error is it? Check result code: • 1: Program dereferenced NULL pointer • 1: Job’s image is corrupt • 1: VM doesn’t have enough memory to run program • 1: Java installation is misconfigured Job shouldn’t run again Job shouldn’t run again Try another machine with more memory Don’t use this machine for Java Grid Experiences
Lesson Don’t trust configuration • Users tells Condor: “Java is installed” • This is just a hint! • Condor verifies Java configuration • Run simple job, verify output • If Java works, Condor advertises that Java can be used • If Java fails, error is reported, Java can’t be used Grid Experiences
Lesson Look for error scope • Add Java wrapper to all Java jobs • Run program • Examine return code/exception • Write all details to file • Examine output of wrapper, or exception from JVM • We know if job is bad • We know if JVM is insufficient for job • We know if JVM is bad Grid Experiences
Error Scope • We could have an entire talk on error scope • Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain • Useful paper even if you don’t use Condor or Java Grid Experiences
Problem condor_submit Globus GRAM Condor-G job agent inetd Globus gatekeeper Globus jobmanager condor_submit Condor job agent Condor matchmaker Execution computer Many layers in a grid Grid Experiences
We forgot inetd • We submitted 300 jobs at once • Inetd noticed many connections per second • Inetd presumed there was a denial of service attack and refused connections for five minutes • Lots of debugging! Grid Experiences
There are more layers! USCMS Testbed Architecture (A bit dated) Master Site Worker Impala Globus MOP Batch System (Condor, PBS) DAGMan Real Work Condor-G Grid Experiences
MCRunJob Impala MOP condor_schedd DAGMan Condor-G condor_schedd condor_gridmanager gahp_server globus-gatekeeper globus-job-manager globus-job-manager-script.pl local batch system submit local batch system execute MOP wrapper Impala wrapper actual job More layers than that! USCMS Testbed Architecture (A bit dated) This disregards inetd, network, file servers, file transfers… Grid Experiences
Lesson Recovery at multiple levels • Fault-tolerance and recovery is built in at many levels: • Condor_master: restart daemons • Condor_schedd: job queue • DAGMan: checkpoint DAG of jobs • Gahp_server: isolate Globus libraries • And others… Grid Experiences
Lesson Allocate debugging time • Allocate lots of debugging time • It is very hard to propagate errors • How does a user find a remote error? • Call system administrator • Admin looks through log files for each layer (not accessible to user) • We need better debugging methods Grid Experiences
Problem Everything will fail(Everything) • In the USCMS testbed production: • Power outage for several hours • Network outages: few minutes-11 hr. • Failed configuration change • Site upgraded • Jobs accidentally removed • Software bugs everywhere Grid Experiences
How do you cope? • Condor-G: • Error: job cannot run. This is not good enough • Resubmit jobs that can be resubmitted, perhaps after a delay • Put jobs on hold in queue: • User examines hold reason (proxy is expired) • User fixes error • User restarts job Grid Experiences
Problem Everything will fail(Even the little things) • Condor Matchmaker: • Collects descriptions of machines & jobs • Soft state in matchmaker (push smarts to edge, like Internet) • UDP packets to advertise machines • Less overhead than many TCP connections • Works great in a LAN • But… Grid Experiences
Everything will fail: UDP • But you lose some UDP packets • Send packets every five minutes • Keep stale information for 15 minutes • Be prepared to cope with stale information • This has worked for years in Condor • DØ: matchmaking on grid • UDP packets from Korea to Chicago were completely lost on weekdays • Added TCP option Grid Experiences
Lesson Be prepared • Assume everything will fail • Have recovery at multiple levels • Understand scope of errors • Don’t trust configuration: • Verify it • Install & configure software “on the fly” • Assume bugs are everywhere • Build software to cope with errors Grid Experiences