150 likes | 310 Views
Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week 2006 April 25, 2006. Resource Selection in OSG & SAM-On-The-Fly. Resource Selection in OSG. Overview Why Resource Selection Service? Resource Selection Service in OSG Collaborators Involved
E N D
Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week 2006 April 25, 2006 Resource Selection in OSG &SAM-On-The-Fly Parag Mhashilkar, Fermilab
Resource Selection in OSG • Overview • Why Resource Selection Service? • Resource Selection Service in OSG • Collaborators Involved • Resource Selection Service Architecture • Current Status • Future Work Parag Mhashilkar, Fermilab
Job Resource Selection Service 1 2 3 ……...N Resources Why Resource Selection Service? • A job can • Have special requirements. Example: disk > 1GB, memory > 256MB • Resources can • Provide special services. Example: disk > 5GB, memory > 512 MB, Software toolkit-X installed, etc. • Without a resource selection service • User has to keep track of availability of every resource that can run the job. • Resource selection service can • Gather the information about the job and resources and make decision where the job should run. • Dereference abstract attributes to bind to the job during match-making or execution time. Parag Mhashilkar, Fermilab
Resource Selection Service (ReSS) in OSG • The Resource Selector is a component of the OSG Job Management Infrastructure. • Sponsored by PPDG, the project started in Sep 2005, with an aim to develop and deploy a Resource Selection Service that VOs with requirements on job management similar to DZero can use. Requirements that ReSS should support – • community of 100 users, submitting jobs to 10 job schedulers. • 10,000 jobs per day, with bursts of 2,000 per hour. • 100 clusters • job and resource descriptions in classad format with 200 attributes and 5Kb of information. • With ReSS • Emphasis is on supporting several Virtual Organizations (VO) based on policies. • VOs can tag resources which are certified to run their jobs making resource selection more manageable. Parag Mhashilkar, Fermilab
Collaborators Involved • VOs • DZero • Atlas • LIGO • FermiGrid • Fermilab • OSG TG-MIG group • CEMon group from INFN • Condor group from UW Madison • GLUE group from INFN Parag Mhashilkar, Fermilab
job What Gate? Info Gatherer classads Condor Match Maker Condor Scheduler Gate 3 job classads classads classads Gate1 CEMon Gate2 CEMon Gate3 CEMon jobs info jobs info jobs info CE GIP CE GIP CE GIP job-managers job-managers job-managers job-managers job-managers job-managers job-managers job-managers job-managers CLUSTER CLUSTER CLUSTER Resource Selection Service Architecture Parag Mhashilkar, Fermilab
Architecture … • Generic Information Provider (GIP) describes resources in LDIF format using GLUE Schema. • CEMon provides flexible plug-in mechanism to translate classads. • Information Gatherer (IG) • Subscribes to several CEMons to gather the information about the CEs and advertises it to several condor pools. • It acts as an adapter between CEMons and Condor matchmaker. • Support for callouts to external match-making functions. These functions can make match-making more extensible. Parag Mhashilkar, Fermilab
Current Status • First release of the ReSS is scheduled to be included in OSG ITB-0.5.0 • Focus on testing functionality, scalability and stress test of Information Gatherer. • Validate Classads from different sites so they can be used for common resource selection criteria. • Study the scalability and investigate how IG handles O(10) CEMon registrations and O(100) classad processing and transferring to the condor_collector. • Stress test study of the IG. Simulate the load of the production environment by increasing 10 times the frequency of classad publication by the O(10) CEMon's. • Stress test the match making infrastructure submitting O(1) job/sec for 1 hour. In particular, and push the limits ….. • Evaluate the efficiency of the condor_negotiator using call-out to external code for match-making. Parag Mhashilkar, Fermilab
Future Work • Working on deployment procedures for OSG production in context of VDT. • Work with other VOs with requirements similar to mentioned earlier and extend the support of ReSS for other VOs. • Improve the scalability of ReSS beyond the RunII experiments. • Have end-to-end Samgrid-OSG integration by OSG 0.6.0 Parag Mhashilkar, Fermilab
Sam-on-the-fly • Overview • What is SAM? • Why sam-on-the-fly? • Addressing the Challenges • Current Status Parag Mhashilkar, Fermilab
What is SAM? • Samgrid consists of • Job Management (JIM) • Data Management (SAM) • SAM stands for ‘Sequential Access via Metadata’ (SAM). • The project was started in 1997 by DZero • SAM is organized around the concepts of a dataset (Catalog of file metadata). • Experiments: • DZero, CDF, MINOS Parag Mhashilkar, Fermilab
Why Sam-on-the-fly? • Sites have resources that are available for longer duration. For example cluster at UW has 1TB disk for DZero users for next 2 months. • SAM-on-the-fly tries to address the issue of making the resources available for the users dynamically. • Before DZero users can use this resource, there is a need to • Deploy and configure SAM services like • Station (collection of resources controlled by SAM system) • Stager (service to handle staging of files on disk used by SAM) • FSS (service to interface with the FS) • File transferring services like gridftp, sam_fcp, etc. • Register SAM services with central SAM DB • Start and Stop SAM station services. • Do the cleanup when the lease period expires. • Firewall and security configurations. Parag Mhashilkar, Fermilab
Addressing the Challenges Deploy and configure SAM Register SAM services with the SAM system Start SAM services for the duration of lease When the lease expires, stop SAM Do the cleanup Job Resource Parag Mhashilkar, Fermilab
Current Status • Automated the product deployment steps. • Semi-Automated the SAM services registration steps. • Automated starting and stopping of SAM services. • This project is a work in progress. • People: • Fermi National Accelerator Laboratory • University of Wisconsin Madison: Alain Roy and Hidayat Teonadi. Parag Mhashilkar, Fermilab
References • Resource Selection Service for OSG • http://www.opensciencegrid.org • http://osg.ivdgl.org/twiki/bin/view/ResourceSelection/WebHome • SAM • http://projects.fnal.gov/samgrid Thanks to Miron and Condor Group for all the support! Questions? Parag Mhashilkar, Fermilab