280 likes | 435 Views
Grid Testbed Activities in US-CMS. Rick Cavanaugh University of Florida 1. Infrastructure 2. Highlight of Current Activities 3. Future Directions NSF/DOE Review LBNL, Berkeley. 14 January, 2003. Wisconsin. Fermilab. Caltech. UCSD. Florida. US-CMS Development Grid Testbed.
E N D
Grid TestbedActivities in US-CMS Rick Cavanaugh University of Florida 1. Infrastructure 2. Highlight of Current Activities 3. Future Directions NSF/DOE Review LBNL, Berkeley 14 January, 2003
Wisconsin Fermilab Caltech UCSD Florida US-CMS Development Grid Testbed • Fermilab • 1+5 PIII dual 0.700 GHz processor • machines • Caltech • 1+3 AMD dual 1.6 GHz processor • machines • San Diego • 1+3 PIV single 1.7 GHz processor • machines • Florida • 1+5 PIII dual 1 GHz processor machines • Wisconsin • 5 PIII single 1 GHz processor machines • Total: • ~41 1 GHz dedicated processors • Operating System: Red Hat 6 • Required for Objectivity NSF/DOE Review
US-CMS Integration Grid Testbed • Fermilab • 40 PIII dual 0.750 GHz processor • machines • Caltech • 20 dual 0.800 GHz processor machines • 20 dual 2.4 GHz processor machines • San Diego • 20 dual 0.800 GHz processor machines • 20 dual 2.4 GHz processor machines • Florida • 40 PIII dual 1 GHz processor machines • CERN (LCG site) • 72 dual 2.4 GHz processor machines • Total: • 240 0.85 GHz processors: Red Hat 6 • 152 2.4 GHz processors: Red Hat 7 Fermilab Caltech UCSD Florida CERN NSF/DOE Review
DGT Participation by other CMS Institutes Encouraged! Wisconsin Fermilab Caltech • Expression of interest: UCSD Florida MIT Rice Minnesota Belgium Brazil South Korea NSF/DOE Review
Wisconsin Fermilab Caltech UCSD Florida Grid Middleware • Testbed Based on the Virtual Data Toolkit 1.1.3 • VDT Client: • Globus Toolkit 2.0 • Condor-G 6.4.3 • VDT Server: • Globus Toolkit 2.0 • mkgridmap • Condor 6.4.3 • ftsh • GDMP 3.0.7 • Virtual Organisation Management • LDAP Server deployed at Fermilab • Contains the DN’s for all US-CMS Grid Users • GroupMAN (from PPDG and adapted from EDG) used to manage the VO • Investigating/evaluting the use of VOMS from the EDG • Use D.O.E. Science Grid certificates • Accept EDG and Globus certificates NSF/DOE Review
Wisconsin Fermilab Caltech UCSD Florida Non-VDT Software Distribution • DAR (can be installed“on the fly”) • CMKIN • CMSIM • ORCA/COBRA • Represents a crucial step forward • in CMS distributed computing! • Working to deploy US-CMS Pacman Caches for: • CMS Software (DAR, etc) • All other non-VDT Software required for the Testbed • GAE/CAIGEE (Clarens, etc), GroupMAN, etc NSF/DOE Review
Monitoring and Information Services • MonaLisa (Caltech) • Currently deployed on the Test-bed • Dynamic information/resource discovery mechanism using agents • Implemented in • Java / Jini with interfaces to SNMP, MDS, Ganglia, and Hawkeye • WDSL / SOAP with UDDI • Aim to incorporate into a “Grid Control Room” Service for the Testbed NSF/DOE Review
Other Monitoring and Information Services • Information Service and Config. Monitoring: MDS (Globus) • Currently deployed on the Testbed in a hierarchical fashion • Aim to deploy the GLUE Schema when released by iVDGL/DataTAG • Developing API's to and from MonaLisa • Health Monitoring: Hawkeye (Condor) • Leverages the ClassAd system of collecting dynamic information on large pools • Will soon incorporate Heart Beat Monitoring of Grid Services • Currently deployed at Wisconsin and Florida NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring User VDT Client VDT Server NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring User VDT Client VDT Server Compute Resource Reliable Transfer Storage Resource Replica Management NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring User VDT Client VDT Server Executor Compute Resource Reliable Transfer MOP Virtual Data Sys. Storage Resource Replica Management NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring Performance Info.&Config. Health User VDT Client VDT Server Executor Compute Resource Reliable Transfer MOP Virtual Data Sys. Storage Resource Replica Management NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring Performance Info.&Config. Health MonaLisa MDS Hawkeye User VDT Client VDT Server Executor DAGMan Condor-G / Globus Compute Resource Reliable Transfer MOP Globus GRAM / Condor Pool ftsh wrapped GridFTP Virtual Data Sys. Storage Resource Local Grid Storage Replica Management Replica Catalogue GDMP NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring Performance Info.&Config. Health MonaLisa MDS Hawkeye User VDT Client VDT Server Executor DAGMan Condor-G / Globus Compute Resource Reliable Transfer MOP Globus GRAM / Condor Pool ftsh wrapped GridFTP mop_submitter Virtual Data Sys. Storage Resource Local Grid Storage Replica Management Replica Catalogue GDMP NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring Performance Info.&Config. Health MonaLisa MDS Hawkeye User VDT Client VDT Server Executor DAGMan Condor-G / Globus Compute Resource Reliable Transfer MOP Globus GRAM / Condor Pool ftsh wrapped GridFTP Virtual Data Sys. Storage Resource Virtual Data Catalogue Local Grid Storage Abstract Planner Concrete Planner Replica Management Replica Catalogue GDMP NSF/DOE Review
Existing US-CMS Grid TestbedClient-Server Scheme Monitoring Performance MonaLisa User Client Server Data Movement Data Analysis Clarens ROOT/ Clarens Storage Resource ROOT files Relational Database NSF/DOE Review
Commissioning the Development Grid Testbed with "Real Production" • MOP (from PPDG) Interfaces the following into a complete prototype: • IMPALA/MCRunJob CMS Production Scripts • Condor-G/DAGMan • GridFTP • (mop_submitter is generic) • Using MOP to "commission" the Testbed • Require large scale, production quality results! • Run until the Testbed "breaks" • Fix Testbed with middleware patches • Repeat procedure until the entire Production Run finishes! • Discovered/fixed many fundamental grid software problems in Globus and Condor-G (close cooperation with Condor/Wisconsin) • huge success from this point of view alone VDT Server 1 Condor Globus VDT Client MCRunJob DAGMan/ Condor-G Linker ScriptGen GridFTP Config Globus Master mop-submitter Req. Self Desc. GridFTP Globus VDT Server N Condor Globus GridFTP NSF/DOE Review
Integration Grid Testbed Success Story • Production Run Status for the IGT MOP Production • Assigned 1.5 million events for “eGamma Bigjets” • ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple • 2 months continuous running across 5 testbed sites • Demonstrated at Supercomputing 2002 NSF/DOE Review
Integration Grid Testbed Success Story • Production Run Status for the IGT MOP Production • Assigned 1.5 million events for “eGamma Bigjets” • ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple • 2 months continuous running across 5 testbed sites • Demonstrated at Supercomputing 2002 1.5 Million Events Produced ! (nearly 30 CPU years) NSF/DOE Review
Interoperability work with EDG/DataTAG • MOP Worker Site Configuration File for Padova (WorldGrid): (1-1) Stage-in/out jobmanger grid015.pd.infn.it/jobmanager-fork (SE) or grid011.pd.infn.it/jobmanager-lsf-datatag (CE) (1-2) GLOBUS_LOCATION=/opt/globus (1-3) Shared directory for mop files: /shared/cms/MOP (on SE and NFS exported to CE) (2-1) Run jobmanager: grid011.pd.infn.it/jobmanager-lsf-datatag (2-2) location of CMS DAR installation: /shared/cms/MOP/DAR, (3-1) GDMP install directory = /opt/edg (3-2) GDMP flat file directory = /shared/cms (3-3) GDMP Objectivity file directory (not needed for CMSIM production) (4-1) GDMP job manager: grid015.pd.infn.it/jobmanager-fork • MOP jobs successfully sent from a U.S. VDT WoldGrid site to Padova EDG site • EU CMS production jobs successfully sent from EDG site to U.S. VDT WorldGrid • site • ATLAS Grappa jobs successfully sent from US to a EU Resource Broker and run • on US-CMS VDT WorldGrid site. NSF/DOE Review
Chimera:The GriPhyN Virtual Data System • Chimera currently provides the following prototypes: • Virtual Data Language (VDL) • describes virtual data products • Virtual Data Catalogue (VDC) • used to store VDL • Abstract Job Flow Planner • creates a logical DAG (in XML) called DAX • Concrete Job Flow Planner • interfaces with a Replica Catalogue • provides a physical DAG submission file to • Condor-G/DAGMan • Generic and flexible: multiple ways to use Chimera • as a toolkit and/or a framework • in a Grid environment or just locally VDL XML VDC XML Abs. Plan Logical DAX RC C. Plan. DAG Physical DAGMan NSF/DOE Review
Direction of US-CMS Chimera Work params exec. data • Monte Carlo Production Integration • RefDB/MCRunJob • Already able to perform all production steps • "Chimera Regional Centre" • For quality assurance and scalability testing • To be used with low priority actual production assignments • User Analysis Integration • GAE/CAIGEE work (Web Services, Clarens) • Other generic data analysis packages • Two equal motivations: • test a generic product for which CMS (and ATLAS, etc) will find useful ! • Experiment with Virtual Data and Data Provenance: CMS is an excellent use-case ! ! • Encouraging and inviting more CMS input • Ensure that the Chimera effort fits within CMS efforts and solves real (current and future) CMS needs ! Generator Production Simulator Formator Reconstructor ESD AOD Analysis Analysis NSF/DOE Review
Building a Grid-enabled Physics Analysis Desktop • Data Processing Tools • interactive visualisation and data analysis (ROOT, etc) • Data Catalog Browser • allows a physicist to find collections of data at the object level • Data Mover • embeded window allowing physicist to customise data movement • Network Performance Monitor • allows a physicist to optimise data movement by dynamically monitoring network conditions • Computation resource browser, selector and monitor • allows a physicist to view available resources (primarily for dev. stages of Grid) • Storage resource browser • enables a physicist to ensure that enough disk space is available • Log browser • enables a physicist to get direct feedback from jobs indicating success/failure, etc Many promising alternatives: currently in the process of prototyping and choosing. Productionsystem and data repositories TAG and AOD extraction/conversion/transport (Clarens) ORCA analysis farm(s) (or distributed `farm’ using grid queues) PIAF/Proof/..type analysis farm(s) RDBMS based data warehouse(s) Query Web service(s) Data extraction Web service(s) Production data flow Clarens based Plugin module TAGs/AODs data flow Local analysis tool: PAW/ROOT/… Local disk Web browser Physics Query flow User Picture taken from Koen Holtman and Conrad Steenberg See Julian Bunn's Talk NSF/DOE Review
How CAIGEE plans to use the Testbed Web Client • Based on client-server scheme • one or more inter-communicating servers • small set of of clients logically associated with each server • Scalable tiered architecture: • Servers can delegate execution to another server (same or higher level) on the Grid • Servers offer "web-based services" • ability to dynamically add or improve Web Client Web Client Grid Services Web Server Catalog Abstract Planner Virtual Data Catalogue Monitoring Grid Processes GDMP Materialised Data Catalogue Concrete Planner Execution Priority Manager Grid Wide Execution Service NSF/DOE Review
High Speed Data Transport • R&D work from Caltech, SLAC and DataTAG on data transport is approaching ~1 Gbit/sec per GbE port over long distance networks • Expect to deploy (including disk to disk) on the US-CMS Testbed in 4-6 months • Anticipate progressing from 10 to 100 MByte/sec and eventually 1 GByte/sec over long distance networks (RTT=60 msec across the US) NSF/DOE Review
Future R&D Directions • Workflow generator/planning (DISPRO) • Grid-wide scheduling • Strengthen monitoring infrastructure • VO Policy definition and enforcement • Data analysis framework (CAIGEE) • Data derivation and data provenance (Chimera) • Peer-to-peer collaborative environments • High speed data transport • Operations (what does it mean to operate a Grid?) • Interoperability tests between E.U. and U.S. solutions NSF/DOE Review
Conclusions • US-CMS Grid Activities reaching a healthy "critical mass" in several areas: • Testbed infrastructure (VDT, VO, monitoring, etc) • MOP has been (and continues to be) enormously successful • US/EU interoperability is beginning to be tested • Virtual Data is beginning to be seriously implemented/explored • Data Analysis efforts are rapidly progressing and being prototyped • Interaction with computer scientists has been excellent ! • Much of the work is being done in preparation for the LCG milestone of 24x7 production Grid milestone • We have a lot of work to do, but we feel we are making excellent progress and we are learning a lot ! NSF/DOE Review
Question: Data Flow and Provenance Plots, Tables, Fits AOD ESD Raw TAG Comparisons Plots, Tables, Fits • Provenance of a Data Analysis • "Check-point" a Data Analysis • Audit a Data Analysis Real Data Simulated Data NSF/DOE Review