540 likes | 686 Views
Data Analysis: Algorithms & Methods. Highlights. Vincenzo Innocente (CERN-CMS) Ed Frank (Univ. of Pennsylvania - BaBar). Contributions. General Architecture 12 Foundation Libraries 3 Detector reconstruction (all but one: tracking!) Focus on Program Structure 7 Strictly Algorithms 3
E N D
Data Analysis: Algorithms & Methods Highlights Vincenzo Innocente (CERN-CMS) Ed Frank (Univ. of Pennsylvania - BaBar)
Contributions • General Architecture 12 • Foundation Libraries 3 • Detector reconstruction (all but one: tracking!) • Focus on Program Structure 7 • Strictly Algorithms 3 • Simulation 8 • Detector description 4 Vincenzo Innocente
ORCA Software & Architecture • When project started, most people were worried about ways to bring on the physicists, develop the sub-detector software etc. • Important, major emphasis of the last year, but actually less critical in the long term • Engineering of the architecture, and crucially the data-handling issues, are really the critical items • Tracking algorithms can, and will, be rewritten many times. But having an architecture that allows and keeps track of plug-and-play is vital. • Even now we face very large datasets (multi TB). Production, automation, mirroring, evolution are (some of) the hard issues. Reconstruction is much more than the reconstruction code Vincenzo Innocente
Offline Architecture: New Requirements • Bigger Experiment, higher rate, more data • Larger and dispersed user community performing non trivial queries against a large event store • Make best use of new IT technologies • Increased demand of both flexibility and coherence • ability to plug-in new algorithms • ability to run the same algorithms in multiple environments • guarantees of quality and reproducibility • high-performance user-friendliness Vincenzo Innocente
CMS (offline) Software Quasi-online Reconstruction Environmental data Slow Control Online Monitoring store Request part of event Store rec-Obj Request part of event Event Filter Objectivity Formatter Request part of event store Persistent Object Store Manager Object Database Management System Store rec-Obj and calibrations store Request part of event Data Quality Calibrations Group Analysis Simulation G3 and or G4 User Analysis on demand Vincenzo Innocente
March 2000 HLT Production Plans • 2M events ORCA reconstructed with high-luminosity pile-up • 2-4 Tera-Bytes in Objectivity/Db • 400 CPU-weeks • ~6 Production-Units • ~1-2 Production Units off CERN site • Copy of all data at CERN in hpss, use of IT/ASD AMS-backend to stage data to ~1TB of disk pools • Mirroring of Data to a few off-site centers, including trans-Atlantic Users want (need!) now what they were promised for 2005.. Vincenzo Innocente
Offline Architecture:Solution • One coherent architecture from online event filtering to final physics analysis • Clear definition of Clients’ and Services’ interfaces and roles • Framework which orchestrates instances of all these modules • Set of common foundation libraries Vincenzo Innocente
Software Structure Applications implementing the physics algorithms. Triggers Reconstruction Simulation Analysis One main framework: GAUDI. Various specialised frameworks: visualisation, persistency, interactivity, simulation (Geant4), etc. Frameworks Toolkits Basic libraries: STL, CLHEP, etc. (Vocabulary) Foundation Libraries Vincenzo Innocente
DØ C++ Framework • Set of well established interfaces from which reconstruction and analysis algorithms are built. • Propagates events through a sets of algorithms in a well defined and established manner. • The algorithm configuration and set is determined at program execution time. • The framework hides many system related complexities from the user and the algorithm developer and allow for sharing of code for common or related tasks.
Offline Architecture: Enabling Technologies • C++ & OO • Run Time Dynamic Loading • Event Driven Notification • State Machines • Persistent Object Store • Database Technologies • Networked Client-Server Architectures • Layered Architecture to shield the user from the above! Vincenzo Innocente
CLEO III Dynamic Loading vs. Static Linking • Both equally well supported, can mix. • Static linking required for reconstruction jobs • need stable environment for long periods of time • Dynamic Linking/Loading for rapid code development • Fast turn-around time needed • Cutting link times from hours/minutes to minutes/seconds • Limit the number of libraries to link to: • Proper Layering of code Separation of data types from the algorithms that supply them why would I have to link to a tracker to access tracks??? • No direct links between objects reduces # of libs to link to • instead we use index-list objects (“Lattice”) • Run-time cost of resolving symbols is low! Vincenzo Innocente
CMS Conclusions • An “implicit invocation” architecture is a flexible software solution which can scale with the complexity of the CMS project. • ODBMS, integrated into the framework, • provides a coherent management of persistent objects coupled withrun-time dynamic-loading, allows to automatically configure an application • The framework can effectively shield physics modules from the underlying technology without penalizing performances Vincenzo Innocente
Component-based Architecture NOVA Vincenzo Innocente
Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Offline Architecture:Commonalties and Differences • Event Data Reduction • Externally: Pipes&Filters • Internally: Blackboard • CMS: Action on Demand • External Services (geometry, run conditions etc.) • Mainly procedural • CMS and DØ: “Event” Notification (implicit invocation) Vincenzo Innocente
Offline Architecture:Commonalties and Differences • Distinction among data, detector and algorithms • Only BaBar makes no clear distinction • Access to object-collections by name • everybody uses named registries (flat or tree) • central component of Gaudi (LHCB) Services • Persistency insulation layer: • Transient copy (managed by the framework) • direct smart pointer Vincenzo Innocente
Principal design choices • Separation between “data” and “algorithms” • Data objects primarily carry data, have only basic methods • e.g. Tracking hits • Algorithm objects primarily manipulate data • e.g. Track fitter • Three basic categories of data: • “event data” (obtained from particle collisions, real or simulated) • “detector data” (structure, geometry, calibration, alignment, ....) • “statistical data” (histograms, ....) • Separation between “transient” and “persistent” data. • Isolate user code from persistency technology . • Different optimisation criteria. • Transient as a bridge between independent representations. Vincenzo Innocente
Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Module, event and environment structure • Modules provide the algorithms • Use existing information to create new objects • Styles range from procedural monoliths to OO castles • Framework/AC++ provides control & config • Uses TCL scripting, command line • Production executables run 300 modules • Objects have behaviors, not just values • “Networks of objects collaborate to provide semantics” • Internal form of our track objects is irrelevant • Objects kept in event and environment • Named access in a flat space • event -> Ifd<EmcCluster>::get(“MergedClusters”) • Implemented via ProxyDict • Proxies provide complex access when needed • Ensures physical decoupling Vincenzo Innocente
Algorithms Data T1 Logical view Physical view Parent Data T1 Algorithm A A Transient data store Data T2, T3 Data T2 Data T3 B Data T2 Algorithm B Data T4 C Data T4 Data T3, T4 Algorithm C Data T5 Data T5 • An Algorithm knows only which data (type and name) it uses as input and produces as output. • The only coupling between algorithms is via the data. • The execution order of the sub-algorithms is the responsibility of the parent algorithm. Vincenzo Innocente
Action on Demand Compare the results of two different track reconstruction algorithms Rec Hits Rec Hits Rec Hits Detector Element Hits Event Rec T1 T1 CaloCl Rec T2 Analysis Rec CaloCl T2 Vincenzo Innocente
StMaker GetDataSet() .maker StMaker StMaker AddData() .data .const .const .data 1. Init() 2. Make() “regular” makers communication Vincenzo Innocente
ALICE's choice • Migrate immediately to C++ • Immediately abandon PAW • But accept GEANT3.21 (initially) • Adopt the ROOT framework • Not worried of being dependent on ROOT • Much more worried being dependent on G4, Objy.... • Allow use of FORTRAN and C++ • Allow to start with wrapping and bad design • Impose a single framework • Provide central support, documentation and distribution • Train users in the framework Vincenzo Innocente
Persistent Detector Store DetectorPersistency Service Detector DataService Geant4Service Converter DetElement1 DetElement1 DetElement Converter DetElement G4Converter DetElement2 DetElement G4Converter DetElement2 G4Converter Converter Transient Detector Store Geant4 Representation Detector Data Store Algorithm The transient detector store contains a “snapshot” of the detector data valid for the currently processed event Vincenzo Innocente
For 1st pass LCD used ad hoc file format, one-of-a-kind code for serial-only parsing of detector geom. XML is a standard meta-language for defining markup languages. Good free parsers exist, more tools coming. XML languages are plain-text, self-documenting. Appl. interface to data (XML document) may be serial or random-access. Avoid growing private file formats or, worse, hard-coding parameters. Make it easy (well, easier) for several programs to use same input. LCD J.Bogart Input: Why Use XML?
LCD J.Bogart Detector Description in XML Start subdetector description <lcdparm> <global file=“largeParms2.xml” /> <physical_detector topology=“large” id = “L2” > <volume id=“EM_BARREL” > <tube> <barrel_dimensions inner_r = “196.0” outer_z = “322.0” /> <layering n=“40”> <slice material = “Pb” width = “0.4” /> <slice material = “Tyvek” width = “0.05” /> <slice material = “Polystyrene” width = “0.1” sensitive = “yes” /> </layering> <segmentation cos_theta = “300” phi = “300” /> </tube> <calorimeter type = “em” /> </volume> ... Geometry, materials function End subdectector description
Track Reconstruction Framework: Motivation • We cannot implement the optimal track reconstruction algorithm right away There’s probably no one optimal algorithm but several,each optimized for a specific task • We need a flexible framework for developing and evaluating algorithms • The mathematical complexity of track finding/fitting often limits the number of developers The involved algebra is often localized in a few places • If we could encapsulate the involved algebra in a few classes and separate it from the logic of the algorithm it would make track finding easier for developers Vincenzo Innocente
mcluster pcluster Reconstruction Object Model (BaBar IFR) • Objects encapsulate the behavior of: • reconstruction information (strip, hit, cluster,…) • the detector model (sector, layer, …) • algorithm strategies (clusterizer, …) • etc. strip “hit” : 1D-cluster Vincenzo Innocente
The BaBar Track Fit • Written in OO C++ • Integrated with the BaBar software framework • Exploits a novel formulation of the Kalman equations • Symmetric processing for both track directions • Processing in Parameter and Weight space • reduces the number of matrix inversions required • Fit result is expressed as a Piecewise Helix • Joined helix segments describing ‘most likely’ path through space • Integrates support other tracking operations • Pattern recognition • Alignment • Used to fit >108 tracks in the commissioning run Vincenzo Innocente
Effect Processing Vincenzo Innocente
Code Organization Vincenzo Innocente
KalStub: A Pattern Recognition Tool Vincenzo Innocente
Experience with software development (BaBar IFR) • Inflexible design was spotted when problems repeatedly occurred in the same code areas introducing changes • Applying a more flexible design has usually improved the software management • more effective development • problems isolation • A concrete example: computation of number of interaction lengths: • Abstract base class for cluster curve approximation • Path length in the detector model computation has been tested using a straight line implementation of the curve approximation • Polynomial approximation from a fit in each view was implemented separately • The integration of the two pieces has been immediately successful Vincenzo Innocente
Geant4 Capabilities • Very powerful Geant4 kernel • tracking, stacks, geometry, hits, .. • Extensive & transparent physics models • electromagnetic, hadronic, … • extended energy range, new models • Persistency, Visualization, ... • Surpasses Geant-3 • in nearly every respect Vincenzo Innocente
X-Ray Surveys of Asteroids and Moons Cosmic rays, jovian electrons Solar X-rays, e, p Geant3.21 ITS3.0, EGS4 Courtesy SOHO EIT Geant4 Induced X-ray line emission: indicator of target composition (~100 mm surface layer) C, N, O line emissions included ESA Space Environment & Effects Analysis Section Vincenzo Innocente
Hadronic shower models in Geant4 Typical Example of OO design • Highly structured and layered object model (inheritance tree): • at each level a given set of functionalities is made concrete which will be common to a given branch • 1st level: calculation of cross-sections and final states for particles in flight and at rest in a medium. • 5th: implement the fragmentation function for string decay • Result in a flexible framework to implement new hadronic interaction models Vincenzo Innocente
Changing cuts • Results very stable with variation of cuts • even track length • Also see shower profiles for different cuts (next slide) • between 10mm and 50 microns Vincenzo Innocente
CMS Geometry Model using GEANT4 • Categories based on responsibilities • Geometry categories:CMS specific, OSCAR(Geant4) & Persistent • Hits categories:CMS & OSCAR • User Interactioncategories:User Actions, GUI • Utilities:Materials, Rotation Matrices Vincenzo Innocente
ATLAS Accordion Calorimeter • G3: 0.5 Megabytes, 10 seconds*SPECint95/GeV • STATIC GEOMETRY • 110 Megabytes of memory • CPU time is 9.5 seconds*SPECint95/GeV • PARAMETERIZED GEOMETRY • 1500 seconds*SPECint95/GeV (1D voxelization) • TAILORED GEOMETRY (G4Accordeon) • 8 Megabytes of memory • CPU time is 11.5 seconds*SPECint95/GeV. Vincenzo Innocente
ATLAS Calorimeter • The first results on EM shower simulations are close to test beam and GEANT3 results, but more work is needed to understand the differences. • GEANT4 performance comparable to that of GEANT3 can be achieved. • The design of GEANT4 allows a user to extend GEANT4 functionality. This helps to implement the new idea of “tailored” geometry description that can be used for high performance simulation of any calorimeter or other regular structure. Vincenzo Innocente
G3 geometry AliRun G4 geometry DetectorCode G3toG4 The Virtual MC TGeant3 AliMC TGeant4 TFluka Vincenzo Innocente
Tracking schema Inverse Framework plug-in FLUKA Step GUSTEP AliRun::StepManager Module Version StepManager Add the hit Geant4StepManager Disk I/O Root Vincenzo Innocente
StdHepC++ • There is a strong need for C++ standard Monte Carlo generator interface. • StdHepC++ is a natural object-oriented implementation of such an interface. • At present we have working examples which integrate StdHepC++ with the Fortran versions of Herwig, Pythia, Isajet. • On the other side, StdHepC++ provides event blocks readable by MCFast and Geant3, and will have an interface to Geant4. Vincenzo Innocente
LHC++: what it is (I) • Modular replacement of current CERNLIB for use in HEP experiments • memory management (C++) • persistency (“I/O”) • mathematical library • foundation classes • random number generators • histogramming • fitting • simulation Vincenzo Innocente
LHC++ Present configuration • Object persistency • from RD45 collaboration (Objectivity/DB) • Foundation classes • HEP specific foundation classes (CLHEP) • Random number generators (CLHEP) • Mathematical library from NAG (NAG_C) • covers broad range of functionality • extensions required by CERN will be added in next release (Mark 6) • quality assurance Vincenzo Innocente
LHC++ Present configuration (cont.) • Simulation: GEANT-4 • worldwide collaboration • complete OO design • Histogramming: HTL • Fitting: Gemini, HepFitting packages • interface to any minimizer (at present: NAG, Minuit) • Event generators • Lund people started Pythia-7 (C++) • StdHep++ in process to become part of CLHEP Vincenzo Innocente
LHC++ packages and dependencies Vincenzo Innocente