1 / 42

Ian C. Smith

ULGrid – Experiments in providing a campus grid. Ian C. Smith. Overview. Current Liverpool systems PC Condor pool Job management in ULGrid using Condor-G The ULGrid portal Storage Resource Broker Future developments Questions. Current Liverpool campus systems. ulgbc1

Download Presentation

Ian C. Smith

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ULGrid – Experiments in providing a campus grid Ian C. Smith

  2. Overview • Current Liverpool systems • PC Condor pool • Job management in ULGrid using Condor-G • The ULGrid portal • Storage Resource Broker • Future developments • Questions

  3. Current Liverpool campus systems • ulgbc1 • 24 dual processor Athlon nodes, 0.5 TB storage GigE • ulgbc2 • 38 single processor nodes, 0.6 TB storage, GigE • ulgbc3 / lv1.nw-grid.ac.uk • NW-GRID - 44 dual-core, dual-processor nodes, 3 TB storage, GigE • HCC - 35 dual-core, dual-processor nodes, 5 TB storage, InfiniPath • ulgbc4 / lv2.nw-grid.ac.uk • 94 single core nodes, 8TB RAID storage, Myrinet • PC Condor pool • ~ 300 Managed Windows Service PCs

  4. High Capability Cluster NW-GRID 38u 38u 38u 35 dual processor, dual core nodes (140 cores), 2.4GHz, 8GB RAM, 200GB disk 44 dual processor, dual core nodes (176 cores), 2.2GHz, 8GB RAM, 146 GB disk Front-end Node: ulgbc3 or lv1.nw-grid.ac.uk InfiniPath Interconnect Gig Ethernet Interconnect SATA RAID Disk Subsystem (5.2 TB) Panasas Disk Subsystem (3 TB)

  5. PC Condor Pool • allows jobs to be run remotely on MWS teaching centre PCs at times at which they would otherwise be idle ( ~ 300 machines currently ) • provides high throughput computing rather than high performance computing (maximise number of jobs which can be processed in a given time) • only suitable for DOS based applications running in batch mode • no communication between processes possible (“pleasantly parallel” applications only) • statically linked executables work best (although can cope with DLLs) • can access application files on a network mapped drive • long running jobs need to use Condor DAGMan • authentication of users prior to job submission via ordinary University security systems ( NIS+/LDAP )

  6. Condor and power saving • power saving employed on all teaching centre PCs by default • machines power down automatically if idle for > 30 min and no user logged in but ... • ... will remain powered up if Condor job running until it completes • NIC remains active allowing remote wake-on-LAN • submit host detects if no. of idle jobs > no. of idle machines and wakes up the pool as necessary • couple of teaching centres remain "always available" for testing etc

  7. user login Condor portal Condor view server Condor submit host Condor central manager Condor pool Teaching Centre 1 Teaching Centre 2 ... other centres

  8. Condor research applications • molecular statics and dynamics (Engineering) • prediction of shapes and properties of molecules using quantum mechanics (Chemistry) • modelling of avian influenza propagation in poultry flocks (Vet Science) • modelling of E. Coli propagation in dairy cattle (Vet Science) • model parameter optimization using Genetic Algorithms (Electronic Engineering) • computational fluid dynamics (Engineering) • numerical simulation of ocean current circulation (Earth and Ocean Science) • numerical simulation of geodynamo magnetic field (Earth and Ocean Science)

  9. Boundary layer fluctuations induced by freestream streamwise vortices Flow

  10. Boundary layer ‘streaky structures’ induced by freestream streamwise vortices Flow

  11. ULGrid aims • provide a user friendly single point of access to cluster resources • Globus based with authentication through UK e-Science certificates • job submission should be no more difficult than using a coventional batch system • users should be able to determine easily which resources are available • meta-scheduling of jobs • users should be able to monitor progress of all jobs easily • jobs can be single process or MPI • job submission from either the command line (qsub-style script) or web

  12. ULGrid implementation • originally tried Transfer-queue-over-Globus (ToG) from EPCC for job submission but ... • messy to integrate with SGE • limited reporting of job status • no meta-scheduling possible • decided to switch to Condor-G • Globus monitoring and discovery service (MDS) originally used to publish job status and resource info but ... • very difficult configure • hosts mysteriously vanish because of timeouts (processor overload ? network delays ? who knows ) • all hosts occasionally disappear after single cluser reboot • eventually used Apache web servers to publish information in the form of Condor ClassAds

  13. Condor-G pros • familiar and reliable interface for job submission and monitoring • very effective at hiding the Globus middleware layer • meta-scheduling possible though the use of ClassAds • automatic renewal of proxies on remote machines • proxy expiry handled gracefully • workflows can be implemented using DAGman • nice sysadmin features e.g. • fair-share scheduling • changeable user priorities • accounting

  14. Condor-G cons • user interface is different from SGE, PBS etc • limited file staging facilities • limited reporting of remote job status • user still has to deal directly with Globus certificates • matchmaking can be slow

  15. Local enhancements to Condor-G • extended resource specifications – e.g. parallel environment, queue • extended file staging • ‘Virtual Console’ - streaming of output files from remotely running jobs • reporting of remote job status (e.g. running, idle, error) • modified version of LeSC SGE jobmanager runs on all clusters • web interface • MyProxy server for storage/retrieval of e-Science certificates • automatic proxy certificate renewal using MyProxy server

  16. Specifying extended job attributes • without RSL schema extensions: globusrsl = ( environment = (transfer_input_files file1,file2,file3)\ (transfer_output_files file4,file5 )\ (parallel_environment mpi2) ) • with RSL schema extensions: globusrsl = (transfer_input_files = file1, file2, file3)\ (transfer_output_files = file4,file5 )\ (parallel_environment = mpi2) or ... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3 transfer_output_files = file4, file5 or ... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3

  17. Typical Condor-G job submission file universe = globus globusscheduler = $$(gatekeeper_url) x509userproxy=/opt2/condor_data/ulgrid/certs/bonarlaw.cred requirements = ( TARGET.gatekeeper_url =!= UNDEFINED ) && \ ( name == "ulgbc1.liv.ac.uk" ) output = condori_5e_66_cart.out error = condori_5e_66_cart.err log = condori_5e_66_cart.log executable = condori_5e_66_cart_$$(Name) globusrsl = ( input_working_directory = $ENV(PWD) )\ ( job_name = condori_5e_66_cart )( job_type = script )\ ( stream_output_files = pcgamess.out ) transfer_input_files=pcgamess.in notification = never queue

  18. User login MyProxy server Condor-G portal Condor-G submit host Condor-G central manager CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) CSD AMD cluster (ulgbc1) NW-GRID cluster (ulgbc3) NW-GRID/POL cluster (ulgp4) Condor ClassAds Globus file staging

  19. Storage Resource Broker (SRB) • open source grid middleware developed by San Diego Supercomputing Center allowing distributed storage of data • absolute filenames reflect the logical structure of data rather than its physical location (unlike NFS) • meta-data allows annotation of files so that results can be searched easily at a later date • high speed data movement through parallel transfers • several interfaces available: shell (Scommands), Windows GUI (InQ), X/Windows GUI, web browser (MySRB) also APIs for C/C++, Java, Python • provides most of the functionality needed to build a data grid • many other features

  20. Condor-G central manager/submit host Globus file staging CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) CSD AMD cluster (ulgbc1) NW-GRID cluster (ulgbc3) NW-GRID/POL cluster (ulgp4) meta-data ‘real’ data SRB MCAT server SRB data vaults (distributed storage)

  21. Future developments • make increased use of SRB for file staging and archiving of results in ULGrid • expand job submission to other NW-GRID sites ( and NGS ? ) • encourage use of Condor-G for job submission on UL-Grid/NW-GRID • incorporate more applications into the portal • publish more information in Condor-G ClassAds • provide better support for long running jobs via the portal and improved reporting of job status

  22. Further Information http://www.liv.ac.uk/e-science/ulgrid

More Related