1 / 61

Anaphe OO Libraries for Data Analysis using C++ and Python

Anaphe - OO Libraries for Data Analysis using C++ and Python AIDA – Abstract Interfaces for Data Analysis. Anaphe OO Libraries for Data Analysis using C++ and Python. Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch. Outline. Motivation Anaphe Components C++

quana
Download Presentation

Anaphe OO Libraries for Data Analysis using C++ and Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anaphe - OO Libraries for Data Analysis using C++ and PythonAIDA – Abstract Interfaces for Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  2. AnapheOO Libraries for Data Analysis using C++ and Python Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  3. Outline • Motivation • Anaphe Components • C++ • Lizard: Interactive Data Analysis • Python • Software quality control • Summary Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  4. LHC Computing challenge Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  5. LHC & The Alps Interaction Points ~100m deep 27km circumference Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  6. LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2005 ones !) • 40.000 of today’s boxes (dual P-III 800 MHz) • complex data models • reconstruction s/w is also used for online filtering • needs high quality s/w in order not to waste beam time Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  7. SPS 1969 W and Z 1983 LEP 1989 LEP ends 2000 K&R C 1978 C++ 1985 Linux V 0.01 1991 Java 1995 Ethernet standard 1983 Intel Pentium 1992 Unix V6 first public version 1975 XML 1.0 1997 IBM PC 1981 WWW Lifetime of LHC software = 25 yrs Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  8. Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • s/w written today most probably will be rewritten tomorrow • we must anticipate changes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  9. Anaphe: what it is • Analysisfor physics experiments • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments • memory management • I/O • foundation classes • histogramming • minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  10. Anaphe Components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  11. AIDA Abstract Interfaces for Data Analysis  next talk Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  12. Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  13. ‘Layered’ Approach • Basic functionalities (histograms, fitting, etc.) are available as individual C++ class libraries. • Easy replacing one part without throwing away everything • Objectivity/DB to provide persistence • HepODBMS library (“insulating layer”, “tags”) • Histogram library (HTL) • Fitting libraries (Gemini, HepFitting) • Graphics libraries (Qt, Qplotter) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Apply s/w quality control tools • code checking, testing Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  14. Lizard Interactive Commands Histograms NTuples Fitting Plotting VectorOfPoints Functions Analyzer HTL Tags (HepODBMS Gemini/HepFitting Qplotter VectorOfPoints Abstract types Implementations (HEP-specific) AIDA (Abstract Interfaces for Data Analysis) non-HEP components CLHEP Class Libraries for HEP ANAPHE Components Python / SWIG Objectivity/DB | HBook NAG-C | Minuit Qt (free edition) User Interface - using Abstract Types Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  15. Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  16. 2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  17. Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  18. CLHEP - foundation classes • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  19. Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  20. Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common interface to minimizer engine • very thin layer Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  21. Opening bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  22. Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to file(s) • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • location transparent (for user) access to object • typically by de-referencing of a smart pointer • optimized for (random) disk access (disks deliver pages) • sequential write to file(s) still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  23. Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • would like to see no difference between memory and disk • “Layout” of objects may change during (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  24. Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  25. Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work transparently ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  26. Object Model Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  27. Physical clustering Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  28. Closing bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  29. “Tags”, Ntuples and Events • Tags - a special kind of Ntuple • Always associated with an underlying persistent store • Tags may be used to store “ntuple-like” data • extracted from all over the event • minPt, maxEmiss, nJets, nMuon, trigger, … • Main use: speedup data selection for analysis … • Tag simplifies selection without loosing complexity • Events more complex than a tree structure (“CWN”) • lots of cross-references between classes, containers • Association from the Tag to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  30. Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  31. Anaphe Internals: (Abstract) Interfaces Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  32. AIDA compliance of Anaphe • Presently (Anaphe 3.x) only AIDA 1.0 compliant • Plan to implement AIDA 2.2 Interfaces by end 2001 (Anaphe 4.x) • initially as wrappers to existing interfaces/packages • Will maintain 3.x for some time • ensures stability for users • Development will concentrate on 4.x • while AIDA will evolve further • Similar timeschedule as JAS (Tony Johnson) • OpenScientist (Guy Barrand) already there Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  33. Lizard: a tool for Interactive Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  34. Interactive Data Analysis • Aim: “OO replacement for PAW” (at least) • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • Foresee customization/integration • allow use from within experiment’s s/w • Plan for extensions • “code for now, design for the future” • Ensure maintainability • use of s/w quality control tools Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  35. Scripting - why • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • avoid “one size fits all” or “using power-tool as hammer” • rapid prototyping in “scripting language” • quick turn-around times • performance critical code in “core language” • exploit richer set of features/functionality (e.g. templates in C++) • scripting languages usually less susceptible to changes than “mainstream languages” • potentially longer lifes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  36. Python - why • Python - OO (scripting) language • no “strange $!%-variables” • sensitive to indentation • More easy for users • as Java • Lots of user supplied modules available and ready for use • scientific, numerics, graphics, GUI, network, OS, games, DBs, … • example: http://www.vex.net/parnassus/ • Parnassus Totals: 1173 items in 49 categories. • Also usable in Java (Jython) • used in JAS for scripting • minimize changes needed within AIDA compliant environments Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  37. Python - how • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Very easy to use • swig -c++ -python -shadow -c myClass.h • create shared lib from myClass.cpp and myClass_wrap.c • start python and import myClass.h to use it • Very easy to extend • simply inherit from “swiggified” class in python • modifications can later be fed back into C++ • performance, type safety, special language features (templates), … Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  38. PAW -> Lizard translation • Ntuple projection Lizard • lizard --useHBook • :-) nt = ntm.findNtuple(“higgscand.hbk::cands”) • :-) nplot1D(nt, “mass”, “quality=5 && cut > 198”) • Ntuple projection PAW • pawX11 • paw> h/file 1 higgscand.hbk • paw> nt/pl 10.mass quality=5.and.cut>198 • Assuming file higgscand.hbk contains ntuple with number 10 and title cands Any valid C++ expression Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  39. Tutorials and Examples available Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  40. Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb/HARP) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  41. Software quality control Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  42. Software quality control • Using tools for testing/checking has started • Insure++, CodeWizard • Package dependencies: Ignominy • Set of perl and shell scripts by Lassi Tuura (CMS) • Ignominy scans… • Make dependency data produced by the compilers (*.d files) • Source code for #includes (resolved against the ones actually seen) • Shared library dependencies (“ldd” output) • Defined and required symbols (“nm” output) • And maps… • Source code and binaries into packages • #include dependencies into package dependencies • Unresolved/defined symbols into package dependencies ignominy: dishonour, disgrace, shame; infamy; the condition of being in disgrace, etc.(Oxford English Dictionary) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  43. Ignominy Analysis of Anaphe • Distribution of tools and utilities for LHC era physics • Combination of commercial, free and HEP software • Claims to be a toolkit • Seems to live up to its toolkit claims • Good work on modularity • Clean design is evident in many places • Dependency diagrams often split naturally into functional units Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  44. Package Metrics • Size = total amount of source code (not normalised across projects!) • ACD = average component dependency (~ libraries linked in) • CCD = sum of single-package component dependencies over whole release • Indicates testing/integration cost • NCCD = Measure of CCD compared to a balanced binary tree • A good toolkit’s NCCD will be close to 1.0 • < 1.0: structure is flatter than a binary tree (= independent packages) • > 1.0: structure is more strongly coupled (vertical or cyclic) • Aim: NCCD ~ 1 for given software/functionality Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  45. Metrics: NCCD vs Cycles Includes Fortran ATLAS • NCCD (“spaghetti index”)  1.0: good toolkit < 1.0: indep. packages > 1.0: strongly-coupled ROOT ORCA G4 COBRA Anaphe IGUANA Toolkits & Frameworks Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  46. History • Started after CHEP-2000 • Full version out since June 2001 • Established functionality exceeding PAW • Analyzer component giving direct access to data and libraries of the experiment framework • Based on Abstract Interfaces • Flexible and extensible • Established parallel development of “license free” version while re-using existing libraries • Direct reading/writing of HBook files as an alternative to Objectivity/DB based persistency • Use of Minuit as a replacement for the minimizer of NAG-C Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  47. Ongoing activities • Persistency • De-emphasize Objectivity/DB (in coordination with experiments, IT/DB and LCG) • Use of HBook ntuples • Text files (using AIDA defined XML format) • Planning to use LCG persistency (POOL) • Investigating direct reading of ROOT files • Fitting • Implementing minimizer from GSL • Discussing with the IGUANA team (CMS) to integrate their GUI components • Looking forward for confirmation and/or re-direction of our efforts following the SC2 (RTAGs) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  48. Future enhancements • Access to otherimplementations of components • HBOOK CWNtuples • Communication with Java tools/packages (JAS, Wired) • via AIDA • Reading of ROOT (> V3.0) files • similar to Tony Johnson’s (Java) RootIO package • depends on “stability” of Root file format  • AIDA Ntuple/Histo store • optimized for Ntuples, Histograms as (compressed) XML • Adding other “scripting” languages • Perl , Tcl, cint ? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  49. Challenge: Distributed Computing • Motivation • move code to data • parallel analysis • Techniques • services via AI • late binding • plug-in architecture • End-user (Lizard) • look-and-feel of local analysis • R&D started and first prototype available soon • CORBA based Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

  50. Summary • The architecture of Anaphe shows some important items for flexible and modular data analysis: • weak coupling between components through use of Abstract Interface • basic functionality is covered by individual C++ class libraries • emphasis on usability and maintainability • Major criteria are flexibility, extensibility and interoperability • Recent example: GEANT-4 examples (based on AIDA) • Lizard is an Interactive Data Analysis Tool based on Anaphe components and the Python scripting language (through SWIG) • Lizard is young but has very solid base in mature Anaphe libraries • real plug-in structure • Software quality control is important • tools help to optimize dependencies / minimize maintenance effort Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch

More Related