340 likes | 479 Views
US Grid Initiatives for Particle Physics. Richard P. Mount SLAC HEPCCC SLAC, July 7, 2000. Focus on “Petascale Data Grid and Tier 2 Computing” for LHC, LIGO, SDSS Components: NSF ITR proposal focused on needed Computer Science and middleware (“Virtual Data Toolkit”);
E N D
US Grid Initiatives for Particle Physics Richard P. Mount SLAC HEPCCC SLAC, July 7, 2000
Focus on “Petascale Data Grid and Tier 2 Computing” for LHC, LIGO, SDSS Components: NSF ITR proposal focused on needed Computer Science and middleware (“Virtual Data Toolkit”); Tier 2 hardware and manpower funding for LHC in the context of US Atlas/CMS computing plans (plus LIGO, SDSS in a separate context). Short-term focus on: making existing middleware useful for Run 2, BaBar, RHIC etc. High-speed data transfer; Cached access to remote data. Longer-term focus (was APOGEE): Instrumentation and Monitoring; Modeling distributed data management systems; Agents and virtual data. Funding does not include networks or (much) hardware. GriPhyN and PPDG Grid Physics Network Particle Physics Data Grid (NSF) (DoE)
Collaborators in GriPhyN and PPDG University Scientists DoE Laboratory Scientists University Scientists Scientists with University and Lab Appointments ComputerScientists Physicists HEP Gravity Wave Astronomy ComputerScientists ComputerScientistsSupportingPhysics Physicists HENP ComputerScientistsSupportingPhysics • Relationship: • Significant overlap of GriPhyN and PPDG senior scientists; • Coordinated R&D planned.
PPDG Collaborators Particle Accelerator Computer Physics Laboratory Science ANL X X LBNL X X BNL X X x Caltech X X Fermilab X X x Jefferson Lab X X x SLAC X X x SDSC X Wisconsin X
Sites Participating in PPDG and GriPhyN/LHC CalREN NTON ESNet Abilene ESNet MREN Abilene Wisconsin Fermilab Boston MREN BNL LBNL/UCB CalREN NTON ESNet Abilene ANL ESNet MREN ESNet SLAC JLAB CalREN NTON ESNet Abilene Indiana Caltech ESNet Abilene CalREN NTON ESNet Abilene Florida SDSC Abilene
Management Issues • In GriPhyN/PPDG there are many collaborating Universities/Labs; • The funding per institute is (or will be) modest; Hence: • PPDG has appointed Doug Olson as full-time “Project Coordinator”; • GriPhyN plans full-time project coordinator; • GriPhyN-PPDG management will be coordinated; • GriPhyN-PPDG co-PI s include leading members of ‘customer experiments’; • GriPhyN-PPDG deliverables (as a function of time) to be agreed with management of ‘customer’ experiments.
Longer-Term VisionDriving PPDG and GriPhyN • Agent Computing on • Virtual Data
Why Agent Computing? • LHC Grid Hierarchy Example • Tier0: CERN • Tier1: National “Regional” Center • Tier2: Regional Center • Tier3: Institute Workgroup Server • Tier4: Individual Desktop • Total 5 Levels
Why Virtual Data? Typical particle physics experiment in 2000-2005:On year of acquisition and analysis of data Access Rates (aggregate, average) 100 Mbytes/s (2-5 physicists) 1000 Mbytes/s (10-20 physicists) 2000 Mbytes/s (~100 physicists) 4000 Mbytes/s (~300 physicists) Raw Data ~1000 Tbytes Reco-V1 ~1000 Tbytes Reco-V2 ~1000 Tbytes ESD-V1.1 ~100 Tbytes ESD-V1.2 ~100 Tbytes ESD-V2.1 ~100 Tbytes ESD-V2.2 ~100 Tbytes AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB
Perform all tasks at the ‘best’ place in the Grid; ‘Best’ implies optimization based on cost, throughput, scientific policy, local policy (e.g. ownership), etc. At least 90% of HENP analysis accesses derived data; Derived data may be computed: In advance of access or On the fly Derived data may be stored: Nowhere or as One or many distributed copies. GriPhyN-PPDGDirection-Setting VisionAgent Computing on Virtual Data Maximize analysis capabilities per $ spent on storage, network and CPU.
Towards the Goals • Evaluation and exploitation of computer-science and commercial ‘products’ (Globus, SRB, Grand Challenge, OOFS …); • Instrumentation and monitoring at all levels; • Modeling of distributed data management systems (especially failure modes); • Testing everything in the environment of real physics experiments; • Major computer-science developments in: • Information models; • Resource management and usage optimization models; • Workflow management models; • Distributed service models.
Funding Needs and Perspectives • GriPhyN NSF ITR proposal: • $2.5m/year for 5 years; • Status – proposal appears to have reviewed well … awaiting final decision; • Tier 2 centers and network enhancements: • Plans being developed (order of magnitude $60M); • Discussions with NSF. • PPDG project: • Funded at $1.2M in August 1999 (DoE/OASCR/MICS NGI); • Plus $1.2M in June 2000 (DoE/OASCR/MICS + DoE/HENP) • Heavy leverage of facilities and personnel supporting current HEP experiments. • PPDG future: • FY 2001 onwards – needs in the range $3M to $4M per year.
First Year PPDG Deliverables Implement and Run two services in support of the major physics experiments at BNL, FNAL, JLAB, SLAC: • “High-Speed Site-to-Site File Replication Service”; Data replication up to 100 Mbytes/s • “Multi-Site Cached File Access Service”: Based on deployment of file-cataloging, and transparent cache-management and data movement middleware • First Year: Optimized cached read access to file in the range of 1-10 Gbytes, from a total data set of order One Petabyte Using middleware components already developed by the Proponents
PPDG Site-to-Site Replication Service PRIMARY SITE Data Acquisition, CPU, Disk, Tape Robot SECONDARY SITE CPU, Disk, Tape Robot • Network Protocols Tuned for High Throughput • Use of DiffServfor (1) Predictable high priority delivery of high - bandwidth data streams (2) Reliable background transfers • Use of integrated instrumentationto detect/diagnose/correct problems in long-lived high speed transfers [NetLogger + DoE/NGI developments] • Coordinated reservaton/allocation techniquesfor storage-to-storage performance
PPDG Multi-site Cached File Access System PRIMARY SITE Data Acquisition, Tape, CPU, Disk, Robot Satellite Site Tape, CPU, Disk, Robot University CPU, Disk, Users Satellite Site Tape, CPU, Disk, Robot Satellite Site Tape, CPU, Disk, Robot University CPU, Disk, Users University CPU, Disk, Users
First Year PPDG “System” Components Middleware Components (Initial Choice): See PPDG Proposal Page 15 Object and File-Based Objectivity/DB (SLAC enhanced) Application Services GC Query Object, Event Iterator, Query Monitor FNAL SAM System Resource Management Start with Human Intervention (but begin to deploy resource discovery & mgmnt tools) File Access Service Components of OOFS (SLAC) Cache Manager GC Cache Manager (LBNL) Mass Storage Manager HPSS, Enstore, OSM (Site-dependent) Matchmaking Service Condor (U. Wisconsin) File Replication Index MCAT (SDSC) Transfer Cost Estimation Service Globus (ANL) File Fetching Service Components of OOFS File Movers(s) SRB (SDSC); Site specific End-to-end Network Services Globus tools for QoS reservation Security and authentication Globus (ANL)
Local Site Manager Remote Services logical request (property predicates / event set) Properties, Events, Files Index Request Interpreter files to be retrieved {file:events} 7 4 2 6 8 1 9 3 5 Request Manager File Replica Catalog Request to move files {file: from,to} 11 10 13 12 Request to reserve space {cache_location: # bytes} Storage Reservation service Storage Access service File Access service Cache Manager Local Resource Manager Logical Index service Matchmaking Service Application (data request) Client (file request) Resource Planner Cache Manager GLOBUS Services Layer To Network Fig 1: Architecture for the general scenario - needed APIs
PPDG First Year Progress • Demonstration of multi-site cached file access based mainly on SRB.(LBNL, ANL, U.Wisconsin) • Evaluation and development of bulk-transfer tools (gsiftp, bbftp, sfcp …) • Modest-speed site-to-site transfer servicese.g. SLAC-Lyon, Fermilab to Indiana • Valiant attempts (continuing) to establish a multiple OC12 path between SLAC and Caltech. http://www-user.slac.stanford.edu/rmount/public/PPDG_HENP_april00_public.doc
Progress:Multi-site Cached File Access • Exploratory installations of components of Globus at Fermilab, Wisconsin, ANL, SLAC, Caltech. • Exploratory installations of SRB at LBNL, Wisconsin, ANL, Fermilab; • SRB used in successful demonstration of Wisconsin and Fermilab accessing files, via ANL cache, originating in the LBNL HPSS.
Progress:100 Mbytes/s Site-to-Site • Focus on SLAC – Caltech over NTON; • Fibers in place; • SLAC Cisco 12000 with OC48 and 2 ×OC12 in place; • 300 Mbits/s single stream achieved recently. • Lower speed Fermilab-Indiana trials.
PPDG Work at Caltech (High-Speed File Transfer) • Work on the NTON connections between Caltech and SLAC • Test with 8 OC3 adapters on the Caltech Exemplar multiplexed across to a SLAC Cisco GSR router. Limited throughput due to small MTU in the GSR. • Purchased a Dell dual Pentium III based server with two OC12 ATM cards. Configured to allow aggregate transfer of more then 100 Mbytes/seconds in both directions Caltech SLAC. • Monitoring tools installed at Caltech/CACR • PingER installed to monitor WAN HEP connectivity • A Surveyor device will be installed soon, for very precise measurement of network traffic speeds • Investigations into a distributed resource management architecture that co-manages processors and data
Towards Serious Deployment • Agreement by CDF and D0 to make a serious effort to use PPDG services. • Rapidly rising enthusiasm in BaBar – SLAC-CCIN2P3 “Grid” MUST be made to work.
A Global HEP Grid Program? • HEP grid people see international collaboration as vital to their mission; • CS Grid people are very enthusiastic about international collaborations; • National funding agencies: • Welcome international collaboration; • Often need to show benefits for national competitiveness.
SLAC Computing Richard P. Mount July 7, 2000