Real-life experiences with grids: It’s not as easy as it looks

Real-life experiences with grids:It’s not as easy as it looks Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team Grid Experiences

Who Am I? • Member of Condor Team • Experience with Condor • Experience with grid deployment • Developer of Virtual Data Toolkit • Used by GriPhyN, EDG, LCG… • Packaging of Globus, Condor, etc. • Collaborator with INFN • Working with Paolo Mazzanti • In Bologna for four weeks Grid Experiences

Italy • Italy is beautiful • The food is wonderful • The people are friendly Grid Experiences

Background • Condor’s environment is a little like a grid • Not all computers (grid sites) are under Condor’s control • Computers (grid sites) disappear at the owner’s whim • Everything changes constantly • Condor was built to deal with this dynamic environment • Grid software needs to do the same Grid Experiences

Background • Late 1980s until today • Condor developed and deployed on hundreds of sites • Condor built to deal with failures • Recently • Condor-G: your window to the grid • Condor team has helped deploy grid technology for real use—not just experiments Grid Experiences

Background: Condor • Condor is a batch job system • Goal: High throughput computing • Different than high-performance • Goal: High reliability • Goal: Support distributed ownership Grid Experiences

High-Throughput Computing • Worry about FLOPS/year, not FLOPS/second • Use all resources effectively • Dedicated clusters • Non-dedicated computers (desktop) Grid Experiences

Effective Resource Use • Requires high reliability Computers come and go, your jobs shouldn’t. • Checkpointing • Be prepared for everything breaking • Requires distributed ownership Grid Experiences

Condor-G • Condor-G submits Globus jobs • Jobs are in persistent queue • Unlike globus-job-run • Jobs are retried on system failures • Jobs are held on some failures • Condor-G makes it easy to submit grid jobs Grid Experiences

Background: USCMS • CMS: • Detector online in 2007 • Needs to simulate & reconstruct millions of events • USCMS testbed • Joint PPDG/GriPhyN effort • Integrate CMS tools with grid tools • Globus • Condor-G • Contribute real work to CMS Grid Experiences

Background: USCMS • 7 sites, 250+ CPUs • Spring 2002: Deploy & test • Fall 2002 • Last minute production • 150,000 events in two weeks • Successful, but lots of work • Today: • Wider deployment & use Grid Experiences

Background: DØ • Experiment at Fermilab • Already doing real production, real analysis • Deploying on grid sites today • Condor-G • Globus • SAM Grid Experiences

DØ: Condor-G • They liked Condor-G: • Condor-G missing a feature: • Deciding which grid-site to use • SAM (data handling software) knows where data is located • SAMGrid: • Condor-G asks SAM for advice • Condor-G decides where to run jobs Grid Experiences

DØ: deployment • Spring: Beginning of deployment • Late summer: production • Early results: • It looks good • We have more work to do • Better error reporting • Better matchmaking • What will we learn later? Grid Experiences

Problems & Lessons • During our experiences, we’ve: • Encountered many problems • Developed solutions to these problems • Learned many lessons about grids • This talk: • Shares some interesting problems • Gives some advice & solutions Grid Experiences

Problem Taking a taxi • How do you take a taxi in Paestum, Italy? • We don’t need to: walk 4km there • The ruins were lovely • The ruins were outside • It was about 35°C • Wife is pregnant Grid Experiences

Lesson Use all your resources • Walk up to storekeeper • Ask: Dovay Ooon Taxi? (Dove un taxi?) • Be patient: Wait ten minutes • Take taxi • I assumed my resources (local knowledge, Italian) were insufficient, but they saved me time when I used them Grid Experiences

Lesson Use all your resources • Condor: • Uses dedicated machines (I can walk) • Uses non-dedicated machines (I can sometimes ask for help) • Grids: • Connect your machine rooms • Can you take advantage of other resources? • Avoid mentality “I must control all resources”, and you will prosper Grid Experiences

Grid: distributed machine room? • You can have good control • You can pre-install applications • You know how everything works BUT… • You lose flexibility • How quickly can you upgrade sites? • Did they install everything correctly? • Can you use new grid sites easily? Grid Experiences

Grid: Use all resources • Assume: basic grid software is installed • Assume: nothing else is installed • Bring your software with you • Submit one job: install software • Submit N jobs: use software • You control software • You ensure correct installation • Easy to use any grid site Grid Experiences

Problem Long-running programs • Long-running programs crash • Condor has daemons on each machine: • User (job) agent • Machine agent • Matchmaker • They crash: • Programming errors • Network failures • Disk failures • … Grid Experiences

Lesson Watch programs • Condor master • Small program, rarely changed • Runs Condor daemons • When daemon crashes: • Restart daemon, send email • If it crashes again, restart after backoff • Result: • Many errors are silently fixed • Yet we don’t just ignore crashes Grid Experiences

Problem Short-running programs • Short-running programs crash/hang • Example: globus-url-copy • USCMS testbed: staging data • Some fraction of copies hang or fail • Programming error + delicate network • Hard to reproduce and fix Grid Experiences

Lesson Watch programs • When copy exceeds timeout, kill and retry • Possible to do in shell scripting languages, but not easy • Use Fault Tolerant Shell to watch programs Grid Experiences

Exponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2] Fault Tolerant Shell • Shell language built for coping with errors try for 30 minutes wget http://www.example.com/file.tar.gz gunzip file.tar.gz tar xf file.tar end Grid Experiences

FTSH: exponential backoff • Why exponential backoff? • What if 100 ftsh scripts are executing? • Avoid synchronization  reduce load, increase chance of success • Similar to Ethernet Grid Experiences

Fault Tolerant Shell • Easier to cope with failures: try 5 times wget http://www.example.com/file.tar.gz catch rm –f file.tar.gz failure end Cleanup partially downloaded file, if it exists Grid Experiences

Cope with network failure Cope with disk failure Fault Tolerant Shell • Flexible try for 30 minutes try for 5 minutes wget http://example.com/file.tar.gz end try for 1 minute or 3 times gunzip file.tar.gz tar xf file.tar catch rm –rf file.tar end end Grid Experiences

FTSH: More information • Work of Doug Thain • thain@cs.wisc.edu • Excellent paper: • The Ethernet Approach to Grid Computing, by Doug Thain • Available from: http://www.cs.wisc.edu/~thain • Even if you don’t use FTSH, read this paper! Grid Experiences

Problem Whose error is it? • The source of an error is not always obvious • The source of an error influences how you react to the error • Example: Java universe in Condor Grid Experiences

Java Universe • Users submit Java jobs to Condor • Whose error is it? Check result code: • 1: Program dereferenced NULL pointer • 1: Job’s image is corrupt • 1: VM doesn’t have enough memory to run program • 1: Java installation is misconfigured Job shouldn’t run again Job shouldn’t run again Try another machine with more memory Don’t use this machine for Java Grid Experiences

Lesson Don’t trust configuration • Users tells Condor: “Java is installed” • This is just a hint! • Condor verifies Java configuration • Run simple job, verify output • If Java works, Condor advertises that Java can be used • If Java fails, error is reported, Java can’t be used Grid Experiences

Lesson Look for error scope • Add Java wrapper to all Java jobs • Run program • Examine return code/exception • Write all details to file • Examine output of wrapper, or exception from JVM • We know if job is bad • We know if JVM is insufficient for job • We know if JVM is bad Grid Experiences

Error Scope • We could have an entire talk on error scope • Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain • Useful paper even if you don’t use Condor or Java Grid Experiences

Problem condor_submit Globus GRAM Condor-G job agent inetd Globus gatekeeper Globus jobmanager condor_submit Condor job agent Condor matchmaker Execution computer Many layers in a grid Grid Experiences

We forgot inetd • We submitted 300 jobs at once • Inetd noticed many connections per second • Inetd presumed there was a denial of service attack and refused connections for five minutes • Lots of debugging! Grid Experiences

There are more layers! USCMS Testbed Architecture (A bit dated) Master Site Worker Impala Globus MOP Batch System (Condor, PBS) DAGMan Real Work Condor-G Grid Experiences

MCRunJob Impala MOP condor_schedd DAGMan Condor-G condor_schedd condor_gridmanager gahp_server globus-gatekeeper globus-job-manager globus-job-manager-script.pl local batch system submit local batch system execute MOP wrapper Impala wrapper actual job More layers than that! USCMS Testbed Architecture (A bit dated) This disregards inetd, network, file servers, file transfers… Grid Experiences

Lesson Recovery at multiple levels • Fault-tolerance and recovery is built in at many levels: • Condor_master: restart daemons • Condor_schedd: job queue • DAGMan: checkpoint DAG of jobs • Gahp_server: isolate Globus libraries • And others… Grid Experiences

Lesson Allocate debugging time • Allocate lots of debugging time • It is very hard to propagate errors • How does a user find a remote error? • Call system administrator • Admin looks through log files for each layer (not accessible to user) • We need better debugging methods Grid Experiences

Problem Everything will fail(Everything) • In the USCMS testbed production: • Power outage for several hours • Network outages: few minutes-11 hr. • Failed configuration change • Site upgraded • Jobs accidentally removed • Software bugs everywhere Grid Experiences

How do you cope? • Condor-G: • Error: job cannot run. This is not good enough • Resubmit jobs that can be resubmitted, perhaps after a delay • Put jobs on hold in queue: • User examines hold reason (proxy is expired) • User fixes error • User restarts job Grid Experiences

Problem Everything will fail(Even the little things) • Condor Matchmaker: • Collects descriptions of machines & jobs • Soft state in matchmaker (push smarts to edge, like Internet) • UDP packets to advertise machines • Less overhead than many TCP connections • Works great in a LAN • But… Grid Experiences

Everything will fail: UDP • But you lose some UDP packets • Send packets every five minutes • Keep stale information for 15 minutes • Be prepared to cope with stale information • This has worked for years in Condor • DØ: matchmaking on grid • UDP packets from Korea to Chicago were completely lost on weekdays • Added TCP option Grid Experiences

Lesson Be prepared • Assume everything will fail • Have recovery at multiple levels • Understand scope of errors • Don’t trust configuration: • Verify it • Install & configure software “on the fly” • Assume bugs are everywhere • Build software to cope with errors Grid Experiences

Real-life experiences with grids: It’s not as easy as it looks

Real-life experiences with grids: It’s not as easy as it looks

Presentation Transcript

Mazes in real life….

TEKS 6.1c Integers in Real Life

Real Life RAC Performance Tuning

HP World 2005 Real Life HP-UX Patching Strategies

A Grid-of-Grids Service Architecture for Net-Centric Operations: Further Discussion

Do financial markets have anything to tell us about the design and management of real assets? Using end-of-life oil fie

Educating Children and Youth in Homeless Situations: Laws, Policies, and How They Work in Real Life

Poetry Elements

Spatial Analysis Using Grids

HPCC with Grids and Clouds

HENP Networks and Grids for Global Science

The Future of GRIDs: A European Perspective

LHC Scale Physics in 2008: Grids, Networks and Petabytes

Egypt in the 4 th Revolution: ABSORB real life practice

UNIT 4, Part 2 Portraits of Real Life

HENP Networks and Grids for Global Virtual Organizations

Nature of Real Property

Real World Experiences with ICD-10:Trips, Traps, and Shifts

Can Artificial Life Engender Real Understanding?

Online Worlds And Second Life