630 likes | 646 Views
Explore Anaphe components for interactive data analysis in C++ and Python with software quality control. Learn about the AIDA abstract interfaces for data analysis and how Anaphe aims to replace CERNLIB functionality for High Energy Physics experiments. Discover the architectural issues and architectural components used to maximize flexibility and re-use of existing packages. Stay informed about technology evolution and the challenges faced in handling massive data amounts in the LHC Computing Challenge.
E N D
AnapheOO Libraries for Data Analysis using C++ and Python Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Outline • Motivation • AIDA - Abstract Interfaces for Data Analysis • Anaphe Components • C++ • Lizard: Interactive Data Analysis • Python • Software quality control • Summary Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2005 ones !) • 40.000 of today’s boxes (dual P-III 800 MHz) • complex data models • reconstruction s/w is also used for online filtering • needs high quality s/w in order not to waste beam time Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
SPS 1969 W and Z 1983 LEP 1989 LEP ends 2000 K&R C 1978 C++ 1985 Linux V 0.01 1991 Java 1995 Ethernet standard 1983 Intel Pentium 1992 Unix V6 first public version 1975 XML 1.0 1997 IBM PC 1981 WWW Lifetime of LHC software = 25 yrs Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • s/w written today most probably will be rewritten tomorrow • we must anticipate changes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe: what it is • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments • memory management • I/O • foundation classes • histogramming • minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries • This talk will not cover detector simulation (GEANT-4) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA Abstract Interfaces for Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
The AIDA project • AIDA project (Abstract Interfaces for Data Analysis) was initiated at the HepVis’99 workshop in Orsay • Presently active mainly developers from existing packages • Tony Johnson (JAS) • Andreas Pfeiffer (Lizard/Anaphe) • Guy Barrand (OpenScientist ) • Mark Dönszelmann (Wired) • Developers from LHCb/Gaudi Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Abstract Interfaces • Abstract Interfaces • only pure virtual methods, inheritance only from other A.I. • components use other components onlythrough their A.I. • defines a kind of a “protocol” for a component • Maximize flexibility and re-use of packages • allow each component to develop independently • re-use of existing packages to implement components reduces start-up time significantly • De-couple implementation of a component from its use Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural issue: Components (I) • Identify components by functionality • Define “protocol” using Abstract Interfaces • Emphasize separation of different aspects for each component • Example: Histogram • statistical entity (density distribution of a physics quantity) • view of a “collection of data points” (which can be a density distribution but also a detector efficiency curve) • command to manipulate/store/plot/fit/... • “User’s view” is different from “implementor’s (developer’s) view” • separate Abstract Interfaces for both aspects Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
User Code Histo-IF Fitter-IF Histo- Impl. 1 Fitter- Impl. X Fitter- Impl. Y Use of Components withAbstract Interfaces • User Code uses only Interface classes • IHistogram1D * hist = histoFactory-> create1D(‘track quality’, 100, 0., 10.) • Actual implementations are selected at run-time • loading of shared libraries • No change at all to user code but keep freedom to choose implementation Histo- Impl. 2 Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Across the languages • JAida : C++ access to Java libs • using C++ proxies implementing the C++ Abstract Interfaces to the Java interfaces C++UserCode AIDA-IF C++ JAida AIDA-IF Java Java Lib Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
XML standards • Started with 1D and 2D Histograms • aim: easy transfer between applications • Will extend to other data types • other histos, fits, ntuples, … • Comments/contributions welcome ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
‘Layered’ Approach • Basic functionalities (histograms, fitting, etc.) are available as individual C++ class libraries. • Easy replacing one part without throwing away everything • Objectivity/DB to provide persistence • HepODBMS library (“insulating layer”, “tags”) • Histogram library (HTL) • Fitting libraries (Gemini, HepFitting) • Graphics libraries (Qt, Qplotter) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Apply s/w quality control tools • code checking, testing Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Components: Overview Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
CLHEP - foundation classes • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common interface to minimizer engine • very thin layer Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Opening bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to file(s) • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • location transparent (for user) access to object • typically by de-referencing of a smart pointer • optimized for (random) disk access (disks deliver pages) • sequential write to file(s) still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • would like to see no difference between memory and disk • “Layout” of objects may change during (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work transparently ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object Model Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical clustering Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Closing bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
“Tags”, Ntuples and Events • Tags - a special kind of Ntuple • Always associated with an underlying persistent store • Tags may be used to store “ntuple-like” data • extracted from all over the event • minPt, maxEmiss, nJets, nMuon, trigger, … • Main use: speedup data selection for analysis … • Tag simplifies selection without loosing complexity • Events more complex than a tree structure (“CWN”) • lots of cross-references between classes, containers • Association from the Tag to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Internals: (Abstract) Interfaces Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA compliance of Anaphe • Presently (Anaphe 3.x) only AIDA 1.0 compliant • Plan to implement AIDA 2.2 Interfaces by end 2001 (Anaphe 4.x) • initially as wrappers to existing interfaces/packages • Will maintain 3.x for some time • ensures stability for users • Development will concentrate on 4.x • while AIDA will evolve further • Similar timeschedule as JAS (Tony Johnson) • OpenScientist (Guy Barrand) already there Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard: a tool for Interactive Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Interactive Data Analysis • Aim: “OO replacement for PAW” (at least) • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • Foresee customization/integration • allow use from within experiment’s s/w • Plan for extensions • “code for now, design for the future” • Ensure maintainability • use of s/w quality control tools Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Scripting - why • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • avoid “one size fits all” or “using power-tool as hammer” • rapid prototyping in “scripting language” • quick turn-around times • performance critical code in “core language” • exploit richer set of features/functionality (e.g. templates in C++) • scripting languages usually less susceptible to changes than “mainstream languages” • potentially longer lifes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - why • Python - OO (scripting) language • no “strange $!%-variables” • sensitive to indentation • More easy for users • as Java • Lots of user supplied modules available and ready for use • scientific, numerics, graphics, GUI, network, OS, games, DBs, … • example: http://www.vex.net/parnassus/ • Parnassus Totals: 1173 items in 49 categories. • Also usable in Java (Jython) • used in JAS for scripting • minimize changes needed within AIDA compliant environments Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - how • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Very easy to use • swig -c++ -python -shadow -c myClass.h • create shared lib from myClass.cpp and myClass_wrap.c • start python and import myClass.h to use it • Very easy to extend • simply inherit from “swiggified” class in python • modifications can later be fed back into C++ • performance, type safety, special language features (templates), … Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
PAW -> Lizard translation • Ntuple projection Lizard • lizard --useHBook • :-) nt = ntm.findNtuple(“higgscand.hbk::cands”) • :-) nplot1D(nt, “mass”, “quality=5 && cut > 198”) • Ntuple projection PAW • pawX11 • paw> h/file 1 higgscand.hbk • paw> nt/pl 10.mass quality=5.and.cut>198 • Assuming file higgscand.hbk contains ntuple with number 10 and title cands Any valid C++ expression Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard: History and Present Status • Started after CHEP-2000 • Full version out since June 2001 • “PAW like” analysis functionality plus: • on-demand loading of compiled code using shared libraries • gives full access to experiment’s analysis code and data • based on Abstract Interfaces • flexible and extensible • “License free” version since Sep. 2001 • HBook for RWNtuples and Histogram storage • Minuit as minimizer engine Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb/HARP) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control • Using tools for testing/checking has started • Insure++, CodeWizard • Package dependencies: Ignominy • Set of perl and shell scripts by Lassi Tuura (CMS) • Ignominy scans… • Make dependency data produced by the compilers (*.d files) • Source code for #includes (resolved against the ones actually seen) • Shared library dependencies (“ldd” output) • Defined and required symbols (“nm” output) • And maps… • Source code and binaries into packages • #include dependencies into package dependencies • Unresolved/defined symbols into package dependencies ignominy: dishonour, disgrace, shame; infamy; the condition of being in disgrace, etc.(Oxford English Dictionary) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Ignominy Analysis of Anaphe • Distribution of tools and utilities for LHC era physics • Combination of commercial, free and HEP software • Claims to be a toolkit • Seems to live up to its toolkit claims • Good work on modularity • Clean design is evident in many places • Dependency diagrams often split naturally into functional units Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Package Metrics • Size = total amount of source code (not normalised across projects!) • ACD = average component dependency (~ libraries linked in) • CCD = sum of single-package component dependencies over whole release • Indicates testing/integration cost • NCCD = Measure of CCD compared to a balanced binary tree • A good toolkit’s NCCD will be close to 1.0 • < 1.0: structure is flatter than a binary tree (= independent packages) • > 1.0: structure is more strongly coupled (vertical or cyclic) • Aim: NCCD ~ 1 for given software/functionality Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Metrics: NCCD vs Cycles Includes Fortran ATLAS • NCCD (“spaghetti index”) 1.0: good toolkit < 1.0: indep. packages > 1.0: strongly-coupled ROOT ORCA G4 COBRA Anaphe IGUANA Toolkits & Frameworks Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Future enhancements • Access to otherimplementations of components • HBOOK CWNtuples • Reading of ROOT (> V3.0) files • similar to Tony Johnson’s (Java) RootIO package • AIDA Ntuple/Histo store • optimized for Ntuples, Histograms as (compressed) XML • Communication with Java tools/packages (JAS, Wired) • via AIDA • Adding other “scripting” languages • Perl , Tcl, cint ? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch