350 likes | 493 Views
CMS and LHC Software and Computing The Caltech Tier2 and Grid Enabled Analysis. GAE. Tier2. MonALISA. Overview. CMS Computing & Software: Data Grid / Challenges Grid Projects Grid Analysis Environment (GAE) Tier2 Distributed Physics Production Calorimetry/Muon Software Conclusion.
E N D
CMS and LHC Software and Computing The Caltech Tier2 and Grid Enabled Analysis GAE Tier2 MonALISA
Overview • CMS Computing & Software: Data Grid / Challenges • Grid Projects • Grid Analysis Environment (GAE) • Tier2 • Distributed Physics Production • Calorimetry/Muon Software • Conclusion
CMS Computing and Software: Data Grid model Organization ScopeRequirementsData Challenges
CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 ~PByte/sec ~100-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 ~10 Gbps FNAL Center IN2P3 Center INFN Center RAL Center 10 Gbps Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center ~2.5-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by 2007-8.An Exabyte by ~2015. Physics data cache 0.1 to 10 Gbps Tier 4 Workstations Emerging Vision: A Richly Structured, Global Dynamic System LHC Data Grid Hierarchy:Developed at Caltech
CPT Project PRS Physics Reconstruction and Selection CCS Core Computing & Software TriDAS Online Software Computing Centres Tracker / b-tau CMS Computing Services E-gamma / ECAL Online (DAQ) Software Framework Architecture, Frameworks / Toolkits 11. HCAL Jets, MEt Online Farms Software Users and Developers Environmnt Muons Production Processing & Data Management Physics Groups Higgs SUSY & Beyond SM Standard Model Heavy Ions Leading/significant Caltech activity SCB Chair 1996-2001; CCS/CPT Management Board from 2002-2003 USCMS Leadership
DC04 Data Challenge 30 Million T0 events processed • Concentrated on the “Organized,Collaboration-Managed” Aspects of Data Flow and Access • T0 at CERN in DC04 • 25 Hz input event rate (Peak) • Reconstruct quasi-realtime • Events filtered into streams • Record raw data and DST • Distribute raw data and DST to T1’s FNAL Chicago T1 T1 FZK Karlsruhe RAL Oxford • T1 centres in DC04 • Pull data from T0 to T1 and store • Make data available to PRS • Demonstrate quasi-realtime “fake” analysis of DST’s T1 T0 T1 T1 CNAF Bologna PIC Barcelona Tier2 @ Caltech, UFlorida, UCSD T1 T2 IN2P3 Lyon • T2 centres in DC04 • Pre-challenge production at > 30 sites • Modest tests of DST analysis
CMS Computing and Core Software (CCS) Progress • DC04 (5% complexity): Challenge is complete but post-mortem write-ups are still in progress • Demonstrated that the system can work for well controlled data flow and analysis, and a few expert users • Next challenge is to make this useable by average physicists and demonstrate that the performance scales acceptably • CCS Technical Design Report (TDR): Aligned with LCG TDR submission (July 2005) • DC05 (10%): Challenge Autumn 2005 to avoid “destructive interference” with Physics TDR analysis
GriPhyN PPDG iVDGL Grid Projects • Trillium Coordinates PPDG, GriPhyN and iVDGL. DoE and NSF working together: DOE (labs), NSF (universities). Strengthening outreach efforts • TeraGrid Initially Caltech, Argonne, NCSA, SDSC, now expanded. Massive Grid resources. • CHEPREO Extending Grids to South America FIU, Caltech CMS, Brazil • CAIGEE/GAE • UltraLight Next Generation Grid and Hybrid Optical Network for Data Intensive Research • PPDG (PI) /SciDAC Particle Physics Data Grid. Funded by DOE in 1999. New funding in 2004-6. Deployment of Grid computing in existing HEP experiments. Mainly physicists. • GriPhyN/iVDGL (Co-PI) Grid Physics Network, international Virtual Data Grid Laboratory. Funded by NSF in 1999. Grid Middleware (VDT, “Virtual Data”), Tier2 deployment and Grid Operations. HENP, Astronomy, Gravity Wave physics. • Open Science Grid Caltech/FNAL/Brookhaven …Combine computing resources at several DOE labs and at dozens of universities to effectively become a single national computing infrastructure for science, the Open Science Grid. Others: EU DataGrid, CrossGrid, LHC Computing Grid (LCG), etc.
UltraLight Collaboration:http://ultralight.caltech.edu Caltech, UF, FIU, UMich, SLAC,FNAL,MIT/Haystack,CERN, UERJ(Rio), NLR, CENIC,UCAID,Translight, UKLight, Netherlight, UvA, UCLondon, KEK, Taiwan Cisco, Level(3) • Integrated hybrid (packet-switched + dynamic optical paths) experimental network, leveraging Transatlantic R&D network partnerships; • 10 GbE across US and the Atlantic: NLR, DataTAG, TransLight, NetherLight, UKLight, etc.; Extensions to Japan, Taiwan, Brazil • End-to-end monitoring; Realtime tracking and optimization; Dynamic bandwidth provisioning • Agent-based services spanning all layers of the system,from the optical cross-connects to the applications.
The Grid Analysis Environment (GAE):“Where the Physics Gets Done”
Grid Analysis Environment • The “Acid Test” for Grids; crucial for LHC experiments • Large, Diverse, Distributed Community of users • Support for 100s to 1000s of analysis tasks,shared among dozens of sites • Widely varying task requirements and priorities • Need for Priority Schemes, robust authentication and Security • Operates in a resource-limited and policy-constrained global system • Dominated by collaboration policy and strategy (resource usage and priorities) • Requires real-time monitoring; task and workflowtracking; decisions often based on a Global system view
GAE Architecture ROOT- Clarens/ Cojac/ IGUANA Analysis Client Analysis Client Analysis Client • Analysis Clients talk standard protocols to the “Grid Services Web Server”, a.k.a. the Clarens data/services portal. • Simple Web service API allows Analysis Clients (simple or complex) to operate in this architecture. • Typical clients: ROOT, Web Browser, IGUANA, COJAC • The Clarens portal hides the complexity of the Grid • Key features: Global Scheduler, Catalogs, Monitoring, and Grid-wide Execution service. HTTP, SOAP, XML-RPC Grid Services Web Server Clarens Sphinx Scheduler Catalogs MCRunjob Fully- Abstract Planner Metadata CAIGEE Chimera RefDB/MOPDB Partially- Abstract Planner Virtual Data MonALISA ROOT Applications Data Management Monitoring Replica POOL Fully- Concrete Planner ORCA FAMOS Grid Execution Priority Manager Support for the physics analysis and computing model VDT-Server BOSS Grid Wide Execution Service
Structured Peer-to-Peer GAE Architecture • The GAE, based on Clarens and Web services, easily allows a “Peer-to-Peer” configuration to be built, with associated robustness and scalability features. • The P2P configuration lows easy creation, use and management of complex VO structures.
ROOT (via Clarens) Grid-Enabled Analysis Prototypes COJAC (via Web Services) Collaboration Analysis Desktop JASOnPDA(via Clarens)
GAE Integration with CMS and LHC Software • Clarens Servers:Python and Java versions available • Refdb:(Stores Job/Task parameters or “cards”) Replica of DC04 production details available on Tier2 • POOL:(Persistency framework for LHC) 60 GB POOL file catalog has been created on Tier2, based on DC04 files. • MCRunjob/MOP:(CMS Job/Task workflow description for batch): integration into the Clarens framework, by FNAL/Caltech • BOSS:(CMS Job/Task book-keeping system) INFN is working on the development of a web service interface to BOSS • SPHINX:Distributed scheduler developed at UFL • Clarens/MonALISA Integration:Facilitating user-level Job/Task monitoring: Caltech MURF Summer Student Paul Acciavatti • CAVES:Analysis code-sharing environment developed at UFL • Core System: Service Auto Discovery, Proxy, Authentication.. R&D of middleware grid services for a distributed data analysis system: Clarens integrates across CMS/LCG/EGEE and US Grid Software
GAE Deployment • 20 known Clarens deployments Caltech, Florida (9 machines), Fermilab (3), CERN (3), Pakistan (2+2), INFN (1) • Installation of CMS (ORCA, COBRA, IGUANA,…) and LCG (POOL, SEAL,…) software on Caltech GAE testbed for integration studies. • Work with CERN to include the GAE components in the CMS software stack • GAE components being integrated in the US-CMS DPE distribution • Demonstrated distributed multi user GAE prototype at SC03 and elsewhere • Ultimate goal: GAE backbone (Clarens) deployed at all Tier-N facilities. Rich variety of Clarens web servers offering GAE services interfaced with CMS and LCG software CMS core software and grid middleware expertise
GAE Summary • Clarens Services-Fabric and “Portal to the Grid” maturing. • Numerous servers and clients deployed in CMS • Integration of GAE with MonALISA progressing: • A scalable multi-user system • Joint GAE collaborations with UFlorida, FNAL and PPDG “CS11” very productive • Production work with FNAL • Mentoring PostDocs, CS Students, Undergraduates • Rationalising the new EGEE ARDA/gLite work with GAE • GAE project description and detailed information: http://ultralight.caltech.edu/gaeweb
Caltech Tier2 Background • The Tier2 Concept Originated at Caltech in Spring 1999 • The first Tier2 Prototype was proposed by Caltech together with UCSD in 2000. • It was designed, commissioned and brought into production in 2001. • It Quickly Became a Focal Point Supporting A Variety of Physics Studies and R&D Activities • The Proof of concept of the Tier2 and The LHC Distributed Computing Model • Service to US CMS and CMS for Analysis+Grid R&D • Production: CMS Data Challenges, Annually from Fall 2000; H and Calibration Studies with CACR, NCSA, etc. • Grid Testbeds: Development Integration Production • Cluster Hardware & Network Development
Caltech Tier2 San Diego Tier2 Tier2 – Example Use in 2001 Full Event Database of ~40,000 large objects Denver Client Full Event Database of ~100,000 large objects Request Request Parallel tuned GSI FTP Parallel tuned GSI FTP “Tag” database of ~140,000 small objects Bandwidth Greedy Grid-enabled Object Collection Analysis for Particle Physics Julian Bunn, Ian Fisk, Koen Holtman, Harvey Newman, James Patton The object of this demo is to show grid-supported interactive physics analysis on a set of 144,000 physics events. Initially we start out with 144,000 small Tag objects, one for each event, on the Denver client machine. We also have 144,000 LARGE objects, containing full event data, divided over the two tier2 servers. Using local Tag event database, user plots event parameters of interest User selects subset of events to be fetched for further analysis Lists of matching events sent to Caltech and San Diego Tier2 servers begin sorting through databases extracting required events For each required event, a new large virtual object is materialized in the server-side cache, that contains all tracks in the event. The database files containing the new objects are sent to client using Globus FTP; client adds them to local cache of large objects The user can now plot event parameters not available in the Tag Future requests take advantage of previously cached large objects in the client
Current Tier2 Newisys Quad Opteron Network Server Caltech-Grid3 Caltech-DGT Winchester RAID5 APC UPS Force10 E600 Switch Dell 5224 Switch Caltech-Tier2 Cisco 7606Switch/Router Network Managem-ent Server Network & MonALISA Servers tier2c Quad Itanium2 Windows 2003 Caltech-PG
SuperComputing 2003 (Phoenix) Multiple files of ~800k simulated CMS events, stored on Clarens servers at CENIC POP in LA, and TeraGrid node at Caltech. Transferred > 200 files at rates up to 400MB/s to 2 disk servers at Phoenix Convert files into ROOT format, then publish via Clarens Data Analysis results displayed by Clarens ROOT client 2003 Bandwidth Challenge “Data Intensive Distributed Analysis” Sustained Rate: 26.2 Gbps (10.0 + 8.2 + 8.0) Note: This Bandwidth Challenge subsequently prompted the gift to us of a 10G wave to Caltech Campus (and thence the Tier2).
Caltech Tier2 Users Today • The Tier2 is supporting scientists in California and other remote institutes • Physics studies(Caltech, Davis, Riverside, UCLA) • Physics productions (Caltech, CERN, FNAL, UFlorida) • Network developments and measurements(Caltech, CERN, Korea, SLAC) • MonALISA(Caltech, Romania) • CAIGEE collaboration(Caltech, Riverside, UCSD, Davis) • GAE work(Caltech, Florida, CERN)
US-CMS ASCB “Tier2 Retreat” http://pcbunn.cacr.caltech.edu/Tier2/Tier2-Meeting.html • Two day “Tier2 Retreat” hosted by Caltech • To Review Progress at the 3 Proto-Tier2s; to Update the Requirements, and the Concept • The start of the procedure by which the production US-CMS Tier2 Centres will be chosen • Outcomes: • Fruitful discussions on Tier2 operations, scaling, and role in the LHC Distributed Computing Model • Existing prototypes will become Production Tier2 centres; recognising their excellent progress and successes • Call for Proposal document being prepared by FNAL: Iowa, Wisconsin, MIT, expected to bid
Physics Channel Studies: Monte Carlo Production&CMS Calorimetry and Muon Endcap Software
CMS Events Produced by Caltech for Higgs and Other Analysis • Themes:H and Gravitons, Calibration; Events for DC04 and Physics TDR. • Tier2 & Grid3 • Caltech: We have largest NRAC award on TeraGrid • NCSA Clusters: IA64 Tera-Grid; Platinum IA32; Xeon • SDSC: IA64 TeraGrid cluster • NERSC: AIX Supercomputer • UNM: Los Lobos cluster • *Madison: Condor Flock • PYTHIA, CMSIM, CERNLIB, GEANT3 – ported to IA64. Total 2.6 M Node Hours of CPU Time (**) Some datasets analyzed more than once (***) To Be Done in Second Half of 2004
Calorimetry Core Software(R.Wilkinson, V.Litvin) • Major “Skeleton Transplant” underway: • Adopting the same “Common Detector” core framework as the Tracker & Muon packages • Use of common detector framework facilitates tracking across subdetectors • Common Detector Framework also allows • Realistic grouping of DAQ readout • Cell-by-cell simulation on demand • Taking account of Misalignments • Developers: • R.Wilkinson (Caltech): Lead design + HCAL • V. Litvin (Caltech) preshower • A. Holzner(Zurich-ETH) : geometry + ECAL barrel • H. Neal(Yale): ECAL endcap
Endcap Muon Slice Test DAQ Software • Software developed by R. Wilkinson in 2003-2004 now adopted by the chamber community • Handles VME configuration and control for five kinds of boards • Builds events from multiple data & trigger readout PCs • Supports real-time data monitoring • Unpacks data • Packs simulated data into raw format • Ran successfully at the May/June 2004 Test Beam • Now supported by Rice U. (Padley, Tumanov, Geurts) • Challenges ahead for Sept. EMU/RPC/HCAL tests Builds on Expertise in EMU Software, Reconstruction and Trigger; and Large Body of Work in 2001-3
Summary & Conclusions (1) • Caltech Originated & Led CMS and US CMS Software & Computing in the Early Stages (1996-2001) • Carries Over Into Diverse Lead-Development Roles Today • Leverages strong Grid-related support from DOE/MICS and NSF/CISE • The Tiered “Data Grid Hierarchy” Originated at Caltech is the Basis of LHC Computing • Tier 0/1/2 roles, interactions and task/data movements increasingly well-understood, through data challenges on increasing scales • Caltech’s Tier2 is now a well-developed multi-faceted production and R&D facility for US CMS physicists and computer scientists collaborating on LHC physics analysis, Grid projects, and networking advances • Grid projects (PPDG, GriPhyN, iVDGL, Grid3, UltraLight etc.) are successfully providing infrastructure, expertise, and collaborative effort • CMS Core Computing and Software (CCS), Physics Reconstruction and Selection (PRS) Groups: Excellent Progress • Major confidence boost with DC04 – managed productions peaking at rates > 25Hz; Caltech Tier2 and MonALISA Took Part
Summary & Conclusions (2) • Caltech leading the Grid Analysis Environment work for CMS • Clarens Web-services fabric and “Portal to the Grid” Maturing • Joint work with UFlorida, CERN, FNAL, PPDG/CS11 on system architecture, tools, and productions • Numerous demonstrators and hardened GAE tools and clients implemented, and serving US CMS and CMS Needs • Integration with MonALISA to flesh out the Distributed System Services Architecture need for a managed Grid on a global scale • Groundbreaking work on Core CMS Software and Simulation Studies • HCAL/ECAL/Preshower common software foundation, and EMU DAQ software for CMS Slice Test: developments led by Caltech • Higgs channel background simulations at very large scale, leveraging global resources opportunistically • Enables US CMS Lead Role in H Analysis: Caltech + UCSD
GAE Recent Prototypes Clarens Virtual Organization Management Clarens BOSS Interface Clarens POOL Interface Clarens Remote File Access
Event Generation, Simulation and Reconstruction for CMS – Caltech Tasks (**) Some datasets were reanalyzed more than once (***) To Be Done in Second Half of 2004
Grid Enabled Analysis: User View of a Collaborative Desktop • Physics analysis requires varying levels of interactivity, from “instantaneous response” to “background” to “batch mode” • Requires adapting the classical Grid “batch-oriented” view to a services-oriented view, with tasks monitored and tracked • Use Web Services, leveraging wide availability of commodity tools and protocols: adaptable to a variety of platforms • Implement the Clarens Web Services layer as mediator between authenticated clients and services as part of CAIGEE architecture • Clarens presents a consistent analysis environment to users, based on WSDL/SOAP or XML RPCs, with PKI-based authentication for Security External Services Storage Resource Broker CMS ORCA/COBRA Browser MonaLisa Iguana ROOT Cluster Schedulers PDA ATLAS DIAL Griphyn VDT Clarens VO Management File Access MonaLisa Monitoring Authentication Key Escrow Shell Authorization Logging