280 likes | 496 Views
The SAM-Grid and the use of Condor-G as a grid job management middleware. Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division. Overview. Computation in High Energy Physics The SAM-Grid computing infrastructure The Job Management and Condor-G Real life experience
E N D
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division
Overview • Computation in High Energy Physics • The SAM-Grid computing infrastructure • The Job Management and Condor-G • Real life experience • Future work
High Energy Physics Challenges • High Energy Physics studies the fundamental interactions of Nature. • Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed. • Experiments become every decade more challenging/expensive: the collaborations are large groups of people. • The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed
FNAL Run II detectors DZero
DZero and CDF Institutions The Size of the D0 Collaboration • ~500 Physicists • 72 institutions • 18 Countries
Data size for the D0 Experiment • Detector Data • 1,000,000 Channels • Event size 250KB • Event rate ~50 Hz • On-line Data Rate 12 MBps • 100 TB/year • Total data • detector, reconstructred, simulated • 400 TB/year
Overview • Computation in High Energy Physics • The SAM-Grid computing infrastructure • The Job Management and Condor-G • Real life experience • Future work
The SAM-Grid Project • Mission: enable fully distributed computing for DZero and CDF • Strategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM) • History: SAM from 1997, JIM from end of 2001 • Funds: the Particle Physics Data Grid (US) and GridPP (UK) • People: Computer scientists and Physicists from Fermilab and the collaborating Institutions
Overview • Computation in High Energy Physics • The SAM-Grid computing infrastructure • The Job Management and Condor-G • Real life experience • Future work
Job Management: Requirements • Foster site autonomy • Operate in batch mode: submit and disconnect • Reliability: handle the job request persistently; execute it and retrieve output and/or errors. • Flexible automatic resource selection: optimization of various metrics/policies • Fault tolerance: transient service disruption; automatic rematching and resubmitting capabilities • Automatic execution of complex interdependent job structures.
data meta-data job Flow of: User Interface User Interface User Interface User Interface Submission Submission Global Job Queue Resource Selector Match Making Global DH Services Grid Client Info Gatherer SAM Naming Server Info Manager Info Collector SAM Log Server Resource Optimizer Site Data Handling XML DB server Cluster Site Conf. SAM Station (+other servs) Glob/Loc JID map Local Job Handling ... Grid Gateway SAM DB Server Web Serv SAM Stager(s) Local Job Handler (CAF, D0MC, BS, ...) MDS RC MetaData Catalog Grid Monitoring Info Providers Bookkeeping Service Cache MSS JIM Advertise User Tools Worker Nodes Dist.FS AAA Site Site Site Service Architecture
Technological choices (2001) • Low level resource management: Globus GRAM. Clearly not enough... • Condor-G: right components and functionalities, but not enough in 2001... • DZero and the Condor Team have been collaborating since, under the auspices of PPDG to address the requirements of a large distributed system, with distributively owned and shared resources.
Condor-G: added functionalities I • Use of the condor Match Making Service as Grid Resource Selector • Advertisement of grid site capabilities to the MMS • Dynamic $$(gatekeeper) selection for jobs specifying requirements on grid sites • Concurrent submission of multiple jobs to the same grid resource • at any given moment, a grid site is capable of accepting up to N jobs • the MMS was modified to push up to N jobs to the same site in the same negotiation cycle
Condor-G: added functionalities II • Flexible Match Making logic • the job/resource match criteria should be arbitrarily complex (based on more info than what fits in the classad), statefull (remembers match history), “pluggable” (by administrators and users) • Example: send the job where most of the data are. The MMS contacts the site data handling service to rank a job/site match • This leads to a very thin and flexible “grid broker”
Condor-G: added functionalities III • Light clients • A user should be able to submit a job from a laptop and turn it off • Client software (condor_submit, etc.) and queuing service (condor_schedd) should be on different machines • This leads to a 3 tiers architecture for Condor-G: client, queuing, execution sites. Security was implemented via X509.
Condor-G: added functionalities IV • Resubmission/Rematching logic • If the MMS matched a job to a site, which cannot accept it after trying the submission N times, the job should be rematched to a different site • Flexible penalization of already failed matches
Overview • Computation in High Energy Physics • The SAM-Grid computing infrastructure • The Job Management and Condor-G • Real life experience • Future work
User Interface User Interface Submission Client Submission Client Job Management Match Making Service Match Making Service job_type = montecarlo station_name = ccin2p3-analysis runjob_requestid = 11866 runjob_numevts = 10000 d0_release_version = p14.05.01 jobfiles_dataset = san_jobset2 minbias_dataset = ccin2p3_minbias_dataset sam_experiment = d0 sam_universe = prd group = test instances = 1 MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" && ...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar ..." Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" ... ... MyType "Machine" TargetType "Job" Name "ccin2p3-analysis.d0.prd.jobmanager-runjob" gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob" DbURL "http://ccd0.in2p3.fr:7080/Xindice" sam_nameservice_ "IOR:000000000000002a49444c3........." station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3" ... Broker ext. logic Queuing System Queuing System Information Collector Information Collector JOB Data Handling System Data Handling System Data Handling System Execution Site #1 Execution Site #n Computing Element Computing Element Computing Element Storage Element Storage Element Storage Element Storage Element Storage Element Grid Sensors Grid Sensors Grid Sensors Grid Sensors Computing Element
Montecarlo Production Statistics • Started beginning of 2004.Ramped up in March. • 3 Sites: Wisconsin (...via Miron), Manchester, Lyon. New sites are joining (UTA, LU, OU, LTU,...) • Inefficiency due to the Grid infrastructure « 5% • 30 GB/week = 80,000 events/week (about 1/4 of total production)
Overview • Computation in High Energy Physics • The SAM-Grid computing infrastructure • The Job Management and Condor-G • Real life experience • Future work
Future work of DZero with Condor • Use of DAGMan to automate the management of interdependent grid job structures. • Address potential scalability limits. • Investigate non-central brokering service via grid flocking. • Integrate/Implement a proxy management infrastructure (e.g. MyProxy). • All the rest (...fix bugs, improve error reporting, hand holding, sailing...)
Conclusions • The collaboration between DZero and the Condor team has been very fruitful since 2001. • DZero has worked together with Condor to enhance the Condor-G framework, in order to address the requirements on distributed computing of a large HEP experiment. • DZero is running “production” jobs on the Grid.
Acknowledgments • Condor Team • PPDG • DZero • CDF
More info at… • http://www-d0.fnal.gov/computing/grid/ • http://samgrid.fnal.gov:8080/ • http://d0db.fnal.gov/sam/