250 likes | 428 Views
CMS Data Analysis Current Status and Future Strategy. On behalf of CMS Collaboration Lassi A. Tuura Northeastern University, Boston. Overview. The Context — CMS Analysis Today Data Analysis Environment Architecture Overview COBRA IGUANA GRID/Production Tomorrow and Beyond
E N D
CMS Data AnalysisCurrent Status and Future Strategy On behalf of CMS Collaboration Lassi A. Tuura Northeastern University, Boston Lassi A. Tuura, Northeastern University
Overview • The Context — CMS Analysis Today • Data Analysis Environment Architecture • Overview • COBRA • IGUANA • GRID/Production • Tomorrow and Beyond • Leveraging current frameworks in the Grid-enriched analysis environment • Clarens client-server prototype • Other prototype activities
Context Challenges: Complexity Geographic Dispersion Direct Access To Data Migration from Reconstruction to Trigger Environments: Real-Time Event Filter, Online Monitoring Pre-emptive Simulation, Reconstruction, Analysis Interactive Statistical Analysis
Current CMS Production HEPEVT Ntuples Zebra files with HITS CMSIM (GEANT3) Pythia ORCA/COBRA ooHit Formatter ORCA/COBRA Digitization (merge signal and pile-up) Objectivity Database Objectivity Database OSCAR/COBRA (GEANT4) Objectivity Database IGUANA Interactive Analysis ORCA User Analysis Ntuples or Root files
Complexity of Production 2002 Number of Regional Centers 11 Number of Computing Centers 21 Number of CPU’s ~1000 Largest Local Center 176 CPUs Number of Production Passes for each Dataset(including analysis group processing done by production) 6-8 Number of Files ~11,000 Data Size (Not including fz files from Simulation) 17TB File Transfer by GDMP and by perl Scripts over scp/bbcp 7TB toward T1 4TB toward T2
Interactive Analysis Emacs used to edit a CMS C++ plugin to create and fill histograms OpenInventor-based display of selected event Most of analysis is done using NTUPLEs in PAW, some in ROOT ANAPHE histogram extended with pointers to CMS events Python shell with Lizard & CMS modules Lizard Qt plotter
Behind the Scenes: Frameworks Consistent User Interface Data Browser Generic analysis Tools GRID Distributed Data Store & Computing Infrastructure Analysis job wizards Objy tools ORCA COBRA OSCAR FAMOS Detector/Event Display CMS tools Federation wizards Coherent basic tools and mechanisms
Frameworks Disected ReconstructionAlgorithms EventFilter PhysicsAnalysis DataMonitoring (Grid-aware) Data-Products Specific Frameworks Grid-Uploadable Physics modules Calibration Objects Generic Application Framework Configuration Objects Event Objects Adapters and Extensions ODBMS Basic Services GEANT 3 / 4 CLHEP PAW Replacement C++ Standard Library + Extension Toolkits
Framework Design Basis • Several frameworks provide the environment together • Open:No central framework with all functionality • Frameworks are designed to be extensible • … and to collaborate with other software • Coherent: User sees “final” smooth interface • Achieved by integrating the frameworks together • … but the user does not do this work him/herself ! • Design applied at both framework and object design level • Successfully applied in many parts of CMS software • Applications, persistency; sub-frameworks; visualisation; … • No loss of usability, functionality or performance • Has made it easy to integrate directly with many existing tools • This is nothing novel — it is part of the standard risk-mitigation strategy of any modern industrial solution
Frameworks: COBRA Consistent User Interface Data Browser Generic analysis Tools GRID Distributed Data Store & Computing Infrastructure Analysis job wizards Objy tools ORCA COBRA OSCAR FAMOS Detector/Event Display CMS tools Federation wizards Coherent basic tools and mechanisms
COBRA: Main Components • Push- and pull-mode execution—and any mixture • Reconstruction-on-demand is a key concept in COBRA • Detector-centric reconstruction—push data from event • Reconstruction-unit-centric reconstruction—pull/create data as needed • Event data and related structures • Basic support for commonly needed objects (hits, digis, containers, …) • Application environments • Basic application frameworks, various semi-specialised applications • Lots of error-handling and recovery code (automatic recovery after crash, …) • Meta data: a key component • Data chunking, system and user collections, data streams, file management, job concepts, configuration and setup records, redirected navigation after reprocessing, …
COBRA: Main Strengths • Algorithms in plug-ins • “Publish-yourself-plug-ins”—self-describing data producers • Strong meta-data facilities • Reconstruction-on-demand matches data product concept very well • Grid virtual data products concept really just an extension • Convenient mapping of data products to chunks: files, containers, … • Scatter / gather: decompose jobs, gather data • One logical job can be chopped into many physical processes, we still know it is logically the same job no matter which process it is running in • Adapts automatically to many environments without special configuration: interactive, batch, farm, stand-alone, trigger, … • Through appropriate use of enabling techniques (transactions, locking, refs) • No data post-processing required • Well-matched to production tools (IMPALA)
Object Access Meta Data DDL Source Processing Storage Manager Transaction Manager MSS, Grid & Farm Interface Catalog Manager Schema Manager C++ Binding Lock Server File I/O Page Server Objectivity
Object Access Meta Data DDL Source Processing Storage Manager Transaction Manager MSS, Grid & Farm Interface Catalog Manager Schema Manager C++ Binding Lock Server File I/O Page Server Objectivity Queries Refs & Navigation Cache Management
Object Access Meta Data DDL Source Processing Storage Manager Transaction Manager MSS, Grid & Farm Interface Catalog Manager Schema Manager C++ Binding Lock Server File I/O Page Server Objectivity Collections Configurations (Data Sets) Object Naming Run Resume & Crash Recovery
Object Access Meta Data DDL Source Processing Storage Manager Transaction Manager MSS, Grid & Farm Interface Catalog Manager Schema Manager C++ Binding Lock Server File I/O Page Server Objectivity File Size Control System Management Farm Management
Frameworks: IGUANA Consistent User Interface Data Browser Generic analysis Tools GRID Distributed Data Store & Computing Infrastructure Analysis job wizards Objy tools ORCA COBRA OSCAR FAMOS Detector/Event Display CMS tools Federation wizards Coherent basic tools and mechanisms
User Interface and Visualisation • IGUANA: a generic toolkit for user interfaces and visualisation • Builds on existing high-quality libraries (Qt, OpenInventor, Anaphe, …) • Used to implement specific visualisation applications in other projects • Main technical focus: provide a platform that makes it easy to integrate GUIs as a coherent whole, to provide application services and to visualise any application object • Many categories / layers: GUI gadgets & support, application environment, data visualisers, data representation methods, control panels, … • Designed to integrate with and into other applications • Virtually everything is in plug-ins (can still be statically linked) Object Factory Object Factory Plug-InCache Plug-In Attached ComponentDatabase Plug-InCache Plug-In Plug-InCache Plug-In Unattached Plug-In Plug-In Object Factory
Illustration: 3D Visualisation 3D Browser Twig Browser QMDIShell Browser Site QMDIShell Browser Site QMainWindow Browser Site
IGUANA GUI Integration Integration Action Visualise Results, Modify Objects, Further Interaction
Tomorrow and Beyond • Leverage the current frameworks on the grid • Many native COBRA concepts match well with grid • (Virtual) data products ~ reconstruction-on-demand • Recording and matching configuration and setup information • Production interfaces: catalogs, redirection, MSS hooks • Scatter/gather job decomposition, production environment • COBRA-based applications can be encapsulated for distributed analysis • IGUANA already separates application objects, model and viewer • Many possibilities for introducing distributed links • IGUANA+COBRA provides a platform for a coherent, well-integrated interface no matter where the code runs and data comes and goes • Both have loads of knobs and hooks for integration • Aiming at adapting the existing software where possible • Adapt and work within CMS software (COBRA, ORCA, …) andexisting analysis tools (ROOT, Lizard, …)—don’t replace them
Prototypes: Clarens Web Portals • Grid-enabling the working environment for physicists' data analysis • Communication with clients via the commodity XML-RPC protocol Implementation independence • Server implemented in C++: access to the CMS OO analysis toolkit • Server provides a remote API to Grid tools • The Virtual Data Toolkit: Object collection access • Data movement between tier centres using GSI-FTP • CMS analysis software (ORCA/COBRA) • Security services provided by the Grid (GSI) • No Globus needed on client side, only certificate Service Clarens Web Server http/https RPC Client
Prototypes: Clarens Web Portals… Productionsystem and data repositories TAG and AOD extraction/conversion/transport services ORCA analysis farm(s) (or distributed `farm’ using grid queues) PIAF/Proof/..type analysis farm(s) RDBMS based data warehouse(s) Data extraction Web service(s) Query Web service(s) Tool plugin module Local analysis tool: Lizard/ROOT/… Web browser Local disk Tier 0/1/2 Tier 1/2 Production data flow TAGs/AODs data flow Tier 3/4/5 Physics Query flow User
Other Prototypes • Tag database optimisation • Fast sample selection is crucial • Various models already tried • Experimenting with RDBMS • MOP: distributed job submission system • Allows submission of CMS production jobs from a central location, run on remote locations, and return results • Job Specification: IMPALA • Replication: GDMP • Globus GRAM • Job Scheduling: Condor-G and local systems