260 likes | 394 Views
SGE Training NASA LaRC ASDC Delivered May 5,6,7 2009 Chris Dwan Bioteam. Bioteam Inc. Independent Consulting Shop Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT Many years of industry & academic experience
E N D
SGE TrainingNASA LaRC ASDCDelivered May 5,6,7 2009Chris DwanBioteam cdwan@bioteam.net
Bioteam Inc. • Independent Consulting Shop • Vendor/technology agnostic • Staffed by: • Scientists forced to learn High Performance IT • Many years of industry & academic experience • Our specialty: Bridging the gap between Science & IT cdwan@bioteam.net
Session Goals • Introduce ASDC systems • Detailed introduction to the IBM system • Deliver Sun Grid Engine Training • Encourage follow up cdwan@bioteam.net
Interactive / Small Group Goals • 1 - 2 hours • 1 – 5 people • Users log into systems. • Users type examples, run jobs. • If code is available, bring it. • If specific use cases exist, bring them. cdwan@bioteam.net
Selected ASDC Systems cdwan@bioteam.net
Selected ASDC Systems • Apple Cluster • Online and in use at SCF since 2007 • ~40 dual processor OS X systems (80+ CPUs) • Access through manila and corregidor • Magneto • ~28 quad core linux servers (100+ CPUs) • Online and in production use since 2006 • New Magneto (ORR May 15) • Large, mixed purpose Linux cluster / file store • 176 CPUs dedicated to SCF • 576 CPUs dedicated to production • Disk based archive: 1.1PB cdwan@bioteam.net
Apple Cluster • Access: • LDAP account • manila or corregidor cdwan@bioteam.net
NASA LaRC Science Directorate • Picture taken 9/2/08 • 1.2PB usable space • Fibre connected (384+ fibre ports) • 2,560 individual disk drives • 16 disks per chassis • 10 chassis per rack • 16 racks of disks • IBM Linux servers, mixed P6 and x86 CPUs to support legacy codes • Filesystem: IBM GPFS cdwan@bioteam.net
Operational Readiness Review Mid May 2009 Stay Tuned cdwan@bioteam.net
Interactive hosts • bc201: instrument1-blue • bc202: instrument2-blue • bc203: erbe-blue • bc204: tisa1-blue • bc205: srb1-blue • bc206: srb2-blue • bc207: power1-blue • bc208: power2-blue • bc209: sarba-blue • bc210: consodine-blue • bc211: sofa-blue • bc212: cloudsa-blue • bc213: cloudsb-blue • bc214: inversion-blue cdwan@bioteam.net
Sun Grid Engine Technical Introduction cdwan@bioteam.net
Most “grids” look like this on paper… Dedicated File services Portal node(s) Local Area Network Private Network Compute Nodes info@bioteam.net
… and in reality: info@bioteam.net
… and in reality: info@bioteam.net
… and in reality: info@bioteam.net
Sun Grid Engine History http://blogs.sun.com/templedf/entry/a_little_history_lesson • 1996: • Codine 4.02 • Grid Resource Director (GRD) 1.0 • 2000: • SGE 5.2. Sun acquires Gridware Inc. • 2001: • SGE 5.3. Sun releases source code • Last version called GRD • 2004: • SGE(EE) vs. SGE N1GE vs. SGE cdwan@bioteam.net
Sun Grid Engine References • http://gridengine.sunsource.net/ • Generally, the user manuals are awful • http://gridengine.info/ • Very useful blog run by Chris Dagdigian • My slides / examples are going to be online in-house. • Deep, in house expertise. cdwan@bioteam.net
Compute Farm Logical View Distributed Resource Manager User 1 User N Cluster Network info@bioteam.net
Grid Engine does the following: • Accept work requests (jobs) from users • Puts jobs in a pending area • Sends jobs from the pending area to the best available machine • Manages the job while it runs • Returns results, logs accounting data when the job is finished info@bioteam.net
Huh? • What you need to know: • Don’t worry about queues or specific machines. All you need to do when submitting a job is describe the resources your job will need to run successfully. • Grid Engine will take care of the rest • The ‘default’ settings are good enough for most cases info@bioteam.net
Most useful SGE commands • qsub / qdel • Submit jobs & delete jobs • qstat & qhost • Status info for queues, hosts and jobs • qacct • Summary info and reports on completed job • qrsh • Get an interactive shell on a cluster node • Quickly run a command on a remote host • qmon • Launch the X11 GUI interface info@bioteam.net
Examples cdwan@bioteam.net
Live Examples • Single job • Single job with resource requirements • Job dependency • Task array job • Demand a whole compute node • Consumable resources cdwan@bioteam.net