610 likes | 700 Views
Anaphe - OO Libraries for Data Analysis using C++ and Python AIDA – Abstract Interfaces for Data Analysis. Anaphe OO Libraries for Data Analysis using C++ and Python. Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch. Outline. Motivation Anaphe Components C++
E N D
Anaphe - OO Libraries for Data Analysis using C++ and PythonAIDA – Abstract Interfaces for Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AnapheOO Libraries for Data Analysis using C++ and Python Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Outline • Motivation • Anaphe Components • C++ • Lizard: Interactive Data Analysis • Python • Software quality control • Summary Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
LHC Computing challenge Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
LHC & The Alps Interaction Points ~100m deep 27km circumference Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2005 ones !) • 40.000 of today’s boxes (dual P-III 800 MHz) • complex data models • reconstruction s/w is also used for online filtering • needs high quality s/w in order not to waste beam time Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
SPS 1969 W and Z 1983 LEP 1989 LEP ends 2000 K&R C 1978 C++ 1985 Linux V 0.01 1991 Java 1995 Ethernet standard 1983 Intel Pentium 1992 Unix V6 first public version 1975 XML 1.0 1997 IBM PC 1981 WWW Lifetime of LHC software = 25 yrs Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • s/w written today most probably will be rewritten tomorrow • we must anticipate changes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe: what it is • Analysisfor physics experiments • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments • memory management • I/O • foundation classes • histogramming • minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA Abstract Interfaces for Data Analysis next talk Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
‘Layered’ Approach • Basic functionalities (histograms, fitting, etc.) are available as individual C++ class libraries. • Easy replacing one part without throwing away everything • Objectivity/DB to provide persistence • HepODBMS library (“insulating layer”, “tags”) • Histogram library (HTL) • Fitting libraries (Gemini, HepFitting) • Graphics libraries (Qt, Qplotter) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Apply s/w quality control tools • code checking, testing Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard Interactive Commands Histograms NTuples Fitting Plotting VectorOfPoints Functions Analyzer HTL Tags (HepODBMS Gemini/HepFitting Qplotter VectorOfPoints Abstract types Implementations (HEP-specific) AIDA (Abstract Interfaces for Data Analysis) non-HEP components CLHEP Class Libraries for HEP ANAPHE Components Python / SWIG Objectivity/DB | HBook NAG-C | Minuit Qt (free edition) User Interface - using Abstract Types Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
CLHEP - foundation classes • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common interface to minimizer engine • very thin layer Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Opening bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to file(s) • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • location transparent (for user) access to object • typically by de-referencing of a smart pointer • optimized for (random) disk access (disks deliver pages) • sequential write to file(s) still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • would like to see no difference between memory and disk • “Layout” of objects may change during (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work transparently ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object Model Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical clustering Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Closing bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
“Tags”, Ntuples and Events • Tags - a special kind of Ntuple • Always associated with an underlying persistent store • Tags may be used to store “ntuple-like” data • extracted from all over the event • minPt, maxEmiss, nJets, nMuon, trigger, … • Main use: speedup data selection for analysis … • Tag simplifies selection without loosing complexity • Events more complex than a tree structure (“CWN”) • lots of cross-references between classes, containers • Association from the Tag to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Internals: (Abstract) Interfaces Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA compliance of Anaphe • Presently (Anaphe 3.x) only AIDA 1.0 compliant • Plan to implement AIDA 2.2 Interfaces by end 2001 (Anaphe 4.x) • initially as wrappers to existing interfaces/packages • Will maintain 3.x for some time • ensures stability for users • Development will concentrate on 4.x • while AIDA will evolve further • Similar timeschedule as JAS (Tony Johnson) • OpenScientist (Guy Barrand) already there Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard: a tool for Interactive Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Interactive Data Analysis • Aim: “OO replacement for PAW” (at least) • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • Foresee customization/integration • allow use from within experiment’s s/w • Plan for extensions • “code for now, design for the future” • Ensure maintainability • use of s/w quality control tools Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Scripting - why • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • avoid “one size fits all” or “using power-tool as hammer” • rapid prototyping in “scripting language” • quick turn-around times • performance critical code in “core language” • exploit richer set of features/functionality (e.g. templates in C++) • scripting languages usually less susceptible to changes than “mainstream languages” • potentially longer lifes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - why • Python - OO (scripting) language • no “strange $!%-variables” • sensitive to indentation • More easy for users • as Java • Lots of user supplied modules available and ready for use • scientific, numerics, graphics, GUI, network, OS, games, DBs, … • example: http://www.vex.net/parnassus/ • Parnassus Totals: 1173 items in 49 categories. • Also usable in Java (Jython) • used in JAS for scripting • minimize changes needed within AIDA compliant environments Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - how • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Very easy to use • swig -c++ -python -shadow -c myClass.h • create shared lib from myClass.cpp and myClass_wrap.c • start python and import myClass.h to use it • Very easy to extend • simply inherit from “swiggified” class in python • modifications can later be fed back into C++ • performance, type safety, special language features (templates), … Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
PAW -> Lizard translation • Ntuple projection Lizard • lizard --useHBook • :-) nt = ntm.findNtuple(“higgscand.hbk::cands”) • :-) nplot1D(nt, “mass”, “quality=5 && cut > 198”) • Ntuple projection PAW • pawX11 • paw> h/file 1 higgscand.hbk • paw> nt/pl 10.mass quality=5.and.cut>198 • Assuming file higgscand.hbk contains ntuple with number 10 and title cands Any valid C++ expression Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Tutorials and Examples available Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb/HARP) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control • Using tools for testing/checking has started • Insure++, CodeWizard • Package dependencies: Ignominy • Set of perl and shell scripts by Lassi Tuura (CMS) • Ignominy scans… • Make dependency data produced by the compilers (*.d files) • Source code for #includes (resolved against the ones actually seen) • Shared library dependencies (“ldd” output) • Defined and required symbols (“nm” output) • And maps… • Source code and binaries into packages • #include dependencies into package dependencies • Unresolved/defined symbols into package dependencies ignominy: dishonour, disgrace, shame; infamy; the condition of being in disgrace, etc.(Oxford English Dictionary) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Ignominy Analysis of Anaphe • Distribution of tools and utilities for LHC era physics • Combination of commercial, free and HEP software • Claims to be a toolkit • Seems to live up to its toolkit claims • Good work on modularity • Clean design is evident in many places • Dependency diagrams often split naturally into functional units Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Package Metrics • Size = total amount of source code (not normalised across projects!) • ACD = average component dependency (~ libraries linked in) • CCD = sum of single-package component dependencies over whole release • Indicates testing/integration cost • NCCD = Measure of CCD compared to a balanced binary tree • A good toolkit’s NCCD will be close to 1.0 • < 1.0: structure is flatter than a binary tree (= independent packages) • > 1.0: structure is more strongly coupled (vertical or cyclic) • Aim: NCCD ~ 1 for given software/functionality Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Metrics: NCCD vs Cycles Includes Fortran ATLAS • NCCD (“spaghetti index”) 1.0: good toolkit < 1.0: indep. packages > 1.0: strongly-coupled ROOT ORCA G4 COBRA Anaphe IGUANA Toolkits & Frameworks Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
History • Started after CHEP-2000 • Full version out since June 2001 • Established functionality exceeding PAW • Analyzer component giving direct access to data and libraries of the experiment framework • Based on Abstract Interfaces • Flexible and extensible • Established parallel development of “license free” version while re-using existing libraries • Direct reading/writing of HBook files as an alternative to Objectivity/DB based persistency • Use of Minuit as a replacement for the minimizer of NAG-C Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Ongoing activities • Persistency • De-emphasize Objectivity/DB (in coordination with experiments, IT/DB and LCG) • Use of HBook ntuples • Text files (using AIDA defined XML format) • Planning to use LCG persistency (POOL) • Investigating direct reading of ROOT files • Fitting • Implementing minimizer from GSL • Discussing with the IGUANA team (CMS) to integrate their GUI components • Looking forward for confirmation and/or re-direction of our efforts following the SC2 (RTAGs) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Future enhancements • Access to otherimplementations of components • HBOOK CWNtuples • Communication with Java tools/packages (JAS, Wired) • via AIDA • Reading of ROOT (> V3.0) files • similar to Tony Johnson’s (Java) RootIO package • depends on “stability” of Root file format • AIDA Ntuple/Histo store • optimized for Ntuples, Histograms as (compressed) XML • Adding other “scripting” languages • Perl , Tcl, cint ? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Challenge: Distributed Computing • Motivation • move code to data • parallel analysis • Techniques • services via AI • late binding • plug-in architecture • End-user (Lizard) • look-and-feel of local analysis • R&D started and first prototype available soon • CORBA based Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Summary • The architecture of Anaphe shows some important items for flexible and modular data analysis: • weak coupling between components through use of Abstract Interface • basic functionality is covered by individual C++ class libraries • emphasis on usability and maintainability • Major criteria are flexibility, extensibility and interoperability • Recent example: GEANT-4 examples (based on AIDA) • Lizard is an Interactive Data Analysis Tool based on Anaphe components and the Python scripting language (through SWIG) • Lizard is young but has very solid base in mature Anaphe libraries • real plug-in structure • Software quality control is important • tools help to optimize dependencies / minimize maintenance effort Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch