120 likes | 266 Views
Workflow Tools Used in the SCEC CyberShake Project. Scott Callaghan Southern California Earthquake Center University of Southern California Gateway Workflow Survey December 11, 2009. CyberShake Science. What will peak ground motion be over the next 50 years?
E N D
Workflow Tools Used in the SCEC CyberShake Project Scott Callaghan Southern California Earthquake Center University of Southern California Gateway Workflow Survey December 11, 2009
CyberShake Science • What will peak ground motion be over the next 50 years? • Used in building codes, insurance, government, planning • Probabilistic Seismic Hazard Analysis (PSHA) • Communicated via hazard curves and maps 2% in 50 years 0.6 g Curve for downtown LA Probability of exceeding 0.1g in 50 yrs
Computational Requirements(per science site) SGT Creation Post Processing
CyberShake dependencies Seismogram synthesis PSA extract SGT Seismogram synthesis PSA Mesh generation SGT simulation . . . . . . . . . extract SGT Seismogram synthesis PSA x1 x2 x7,000 x415,000 x415,000
Software Requirements • High throughput • Large number of short-running jobs • Data management • 840,000 output files per science site • Stage-in and -out • Resource provisioning • Acquire grid resources for execution • Possibly multiple execution sites • Error identification and recovery • Use community account
Workflow Tools • Pegasus/Condor/Globus stack • Create workflow description in Pegasus • Abstract workflow (DAX) with logical names • Plan workflow for specific execution site • Concrete workflow (DAG) with physical paths • Adds stage-in and stage-out of data • Wraps in kickstart • Can be mined with NetLogger toolkit • Bundles multiple tasks into single job • Easier to change execution sites
Workflow Tools (2) • Submit workflow via Pegasus • Submits to Condor DAGMan • Tracks dependencies • Matches jobs to resources • Jobs enter queue on local host • Communicates via Globus to remote system • On job failure • Retry job • Write rescue DAG as checkpoint
Software Requirements Fulfilled • High throughput • Data management • Pegasus adds staging automatically, tracks input and output files for jobs • Resource provisioning • Condor supports remote submission to batch queue • Also glideins (temporary Condor pool) • Error identification and recovery • Automatic retries, rescue DAGes
CyberShake Map Calculation • Calculated Southern California hazard map on Ranger (223 sites) • 1200 wallclock hours • 189 million tasks (43 tasks/sec) • 3.9 million Condor jobs • Averaged 4424 cores, 14544 peak (23% of Ranger) • 2.1 TB and 36,000 staged (zipped) output files
Experiences from Map Calculation • For CyberShake, configurability is very important • Adjusted many Condor scheduler parameters • Modified Pegasus bundling factors • Used sub-workflows to manage load • Experimenting with priorities • So is automation • Set up workflow hierarchies to submit multiple workflows • Cron job monitored queue to submit new science sites when needed • Corral automated glidein submission and monitoring
Experiences (cont.) • Can be hard to understand impact of parameters • Performing parameter studies • With high job counts, takes some work to reduce local and remote load • Bundling, glideins, breaking up workflows • A higher-level monitoring tool is needed • Too many log files to tail • Designed a Run Manager to track workflow status