620 likes | 633 Views
This talk discusses the evolution of data analysis in high energy physics computing, exploring new techniques for interacting with data and the challenges of experiment-independent analysis.
E N D
Perspective on Future Data Analysis in HENP Computing in High Energy Physics 2003 La Jolla 24 March René Brun CERN Perspective on Future Data AnalysisL
Data Analysis ?? • Data Analysis has been traditionally associated with the final stages of data processing, ie Physics Analysis. • In this talk, I will cover a more general aspect of Data Analysis (in the true sense). • How to interact with data at all stages of data processing (batch or interactive modes)? • Can we imagine an experiment-independent way to achieve this? Perspective on Future Data Analysis
Evolution • To understand the possible directions, we must understand some messages from the past, the solid recipes! • One important message is “Make it simple”. • Heavy experiment frameworks are often perceived as a serious obstacle and push users to use more basic but universal frameworks. Perspective on Future Data Analysis
Once upon a time (seventies) • With the first electronic (as opposed to bubble chamber) experiments, data analysis was experiment specific, an activity after the data taking. • The only common software was the histograming package (eg Hbook) ,the fitting package (eg Minuit), some plotting packages and independent routines in cernlib (linear algebra and small utilities) • Data structures = Fortran common blocks Perspective on Future Data Analysis
Early Eighties • With the growing complexity of the experiments and corresponding software, we see the development of Data Structures management systems (hydra, zbook-->zebra, bos). • These systems are able to write/read complex bank collections. Zebra had a self-describing bank format with built-in support for bank evolution. • Most data processed in batch, but many prototypes of interactive systems start to appear (htv, gep, then paw..) Perspective on Future Data Analysis
PAW • Designed in 1985. Stable since 1993 • Row-Wise-Ntuples. OK for small data sets, interactive histograming with cuts. • Column-Wise-Ntuples. A major step illustrating the advantage of structured data sets • PAW: a success • not so much because of its technical merits • but perceived as a tool widely available • stability since many years: an important element Perspective on Future Data Analysis
1993-->2000 (1) • Move from Fortran to OO • Took far more time than expected • new language(s) • new programming techniques • basic infrastructure not available to compete with existing libraries and tools • conflicts between projects • ad-hoc software in experiments Perspective on Future Data Analysis
1993-->2000 (2) • False hopes with OODBMS (or too early?) • OODBMS -->Objectivity • OO models designed for Objy • batch oriented • Interactive use via conversion to PAW ntuples • central data base does not fit well with GRID concepts • Licensing problems and more Perspective on Future Data Analysis
Data Analysis Models Perspective on Future Data AnalysisL
From the desktop to the GRID Online/Offline Farms Local/remote Storage Desktop New data analysis tools must be able to use in parallel remote CPUS, storage elements and networks in a transparent way for a user at a desktop GRID Perspective on Future Data Analysis
My laptop in 200X Using a naïve extrapolation of Moore’s law for a state of the art laptop Year CPU/Ghz RAM/GB disk/GB 2003 2.4 0.5 60 2005 5 1 150 2007 10 2 300 2009 20 4 600 2011 40 8 1000 Nice ! But less than 1/1000 of what I need Perspective on Future Data Analysis
Batch-mode Local analysis • Conventional model: The user has full control on the event loop. • The program produces histograms, ntuples or trees. • The selection is via user private code • Histograms are then added (tool or in the interactive session) • ntuples/trees are combined into a chain and analyzed interactively. Perspective on Future Data Analysis
Batch Analysis on the GRID • From a user viewpoint, a simple extrapolation of the local batch analysis. • In practice, must involve all the GRID machinery: authentication, resource brokers, sandboxes. • Viewing the current status (histograms) must be possible. • Advantage: Stateless, can process large data volumes. Advanced systems already exist (see talk by Andreas Wagner) Perspective on Future Data Analysis
Kernel Space Linux File System Kernel /alien/ User Space ******************************************* * * * W E L C O M E to R O O T * * * * Version 3.03/09 3 December 2002 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* Compiled for linux with thread support. CINT/ROOT C/C++ Interpreter version 5.15.61, Oct 6 2002 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. root [0]newanalysis->Submit(); VFS alice/ atlas/ AliEn API ? AliEnFS Query for Input Data LUFS prod/ data/ mc/ a/ b/ Analysis Macro soap:// root:// MSS MSS MSS MSS castor:// CE root:// CE root:// MSS https:// CE MSS MSS MSS MSS MSS CE CE merged Trees +Histograms AliEnFS & Distributed Analysis Perspective on Future Data Analysis
Interactive Local Analysis • On a public cluster, or the user’s laptop. • Tools like PAW or successor are used for visualization and ntuples/trees analysis. Perspective on Future Data Analysis
GRID: Interactive AnalysisCase 1 • Data transfer to user’s laptop • Optional Run/File catalog • Optional GRID software Optional run/File Catalog Analysis scripts are interpreted or compiled on the local machine Trees Remote file server eg rootd Trees Perspective on Future Data Analysis
GRID: Interactive AnalysisCase 2 • Remote data processing • Optional Run/File catalog • Optional GRID software Optional run/File Catalog Analysis scripts are interpreted or compiled on the remote machine Trees Remote data analyzer eg proofd Commands, scripts Trees histograms Perspective on Future Data Analysis
GRID: Interactive AnalysisCase 3 • Remote data processing • Run/File catalog • Full GRID software Run/File Catalog Analysis scripts are interpreted or compiled on the remote master(s) Trees slave Trees Trees Trees slave Remote data analyzer eg proofd slave Commands, scripts slave Trees Histograms,trees Trees slave Trees Trees slave Perspective on Future Data Analysis
Data Analysis Projects Perspective on Future Data AnalysisL
Tools for data analysis • PAW: started in 1985, no major developments since 1994. • HippoDraw: started in 1991 • ROOT: started in 1995, continuous developments • JAS: started in 1995, continuous developments • Open Scientist: ? • LHC++/Anaphe: 1996-->2002 • PI: new project in the LHC Computing Grid, just starting now Perspective on Future Data Analysis
PAW • The reference since 18 years (1985), Used by most collaborations • ported on many platforms, small (3 to 15 MB) • many criticisms during the development phase • applauded since it is stable • maintained by Olivier Couet (ROOT team) Usage still growing 0.1 FTE Perspective on Future Data Analysis
HippoDraw • Author: Paul Kunz • show the way in 1991/1992 • Usage: Paul + “a 50 year-old CERN physicist” • Seems to be in constant prototyping phases • Good to have this type of prototype to illustrate new possible interactive techniques. 1 FTE ? Perspective on Future Data Analysis
ROOT • In constant development since 1995 • Used by many collaborations and outside HEP More than 10000 distributions of binary tar files in February 6 +2+..FTE Perspective on Future Data Analysis
JAS • Started in 1995. (Tony Johnson) • Current version 2. JAS3 presented at this CHEP • For the Java world. • How to cooperate with C++ frameworks? 3 FTE ? Perspective on Future Data Analysis
In AIDA you believe ? • The Abstract Interfaces for Data Analysis project was started by the defunct LHC++ and continued by Anaphe (now stopped). • Supported by JAS and Open Scientist • Goal: define abstract interfaces to facilitate cooperation between developers and facilitate migration of users to new products • Versions 1, 2 and 3 (version 4 for PI ?) Perspective on Future Data Analysis
In AIDA I don’t believe • Abstract Interfaces are fundamental in modern systems to make a system more modular and adaptable. • But, common abstract interfaces are not a good idea. • They force a lowest common denominator • They require international agreements • Users will be confused (what is common and not) • you become slave of a deal: against creativity • It is more important to agree on object interchange formats and data base access • You can easily change a few hundred lines of code. You cannot copy Terabytes of data Perspective on Future Data Analysis
The LCG PI project • Fresh from the oven • One of the projects recently launched by the Applications Area of the LCG project. • Ideas: • promote the use of AIDA (version 4) • Python for scripting • interface to ROOT & CINT • in gestation • see Vincenzo Perspective on Future Data Analysis
User & Developer views • Users Requests • very rarely requests for grandiose new features • zillions of tiny new features • zillions of tiny improvements • want consolidation & stability • Developers view • want to implement the sexy features • target modularity (more complex installation?) • maintenance & helpdesk: a problem or a chance? Perspective on Future Data Analysis
Lessons from the past • It takes time to develop a general tool • more than 7 years for PAW, ROOT and JAS • User feedback is essential in the development phase • People like stable systems • Efficient access to data sets is a prerequisite • 24h x 7days x 12 months x N years online support is vital Perspective on Future Data Analysis
Develop/Debug/maintain In an Interactive system with N basic functions, the number of combinations may be unlimited, (Not NxN, but N! ) 10% of the time to develop first 90% of the code. 90% of the time to develop the remaining 10% Perspective on Future Data Analysis
Time to develop LCG Perspective on Future Data Analysis
Technical aspects Perspective on Future Data AnalysisL
Desktop • Plug-in Manager and Dictionary • GUI • Graphics 2-d, 3-d • Event Displays • Histograming & Fitting • Statistics tools • Scripting • Data/Program organization Perspective on Future Data Analysis
Plug-in Manager Exp Shared libs User Shared lib Exp Shared libs Exp Shared libs Basic Services, GUI, Math.. General Utility Shared lib Plug-in manager I/O manager I/O manager Interpreter Object Dictionary Perspective on Future Data Analysis
The Object Dictionary Object Dictionary Data dictionary Functions dictionary Inspectors Browsers I/O Interpreted scripts GUI Command line Compiled code Perspective on Future Data Analysis
Scripting for data analysis • After KUIP and Tk/Tcl era • Command line Interface required • Scripts • interpreted or/and byte-code interpreted • automatic compilation and linking • call compiled or interpreted code • compiled code must be able to call interpreted code (GUI and configuration scripts) • Big bonus if compiled and interpreted languages are the same • Scripting and object dictionary symbiosis • Remote execution of scripts (in parallel) Perspective on Future Data Analysis
Languages & scripting C++ Compiled code C++ Interpreted scripts Python/Perl scripts GUI with signal/slots Batch User Interactive User Perspective on Future Data Analysis
Comparing scripts Very interesting project from Subir Sarkar Cooperation between Java and a C++ framework based on Object Dictionary http://sarkar.home.cern.ch/sarkar/jroot/main.html Perspective on Future Data Analysis
GUI(s) • Constant evolution • +Microsoft MFC, Win32 API • Signals/Slots principle: very nice. It helps designing large and modular GUI systems • Interpreters help GUI builders/editors 1983 Vax/VMS SMS VT100 1989 MOTIF Unix workstations 1985 GKS Textronix 1997 Java/Swing The Web 2001 Qt Linux/Laptops Perspective on Future Data Analysis
2-D graphics • An area where constant improvements are required. • Better plotters, better fonts,... • Better drivers: postscript, SVG, XML, etc Publication quality is a must. This requirement alone explains why many proposed data analysis systems do not penetrate experiments Perspective on Future Data Analysis
3-D graphics • Data structures: Objects <--> scene • Scene renderers: OpenGL, Open Inventor • Most difficult is detector geometry graphics • z-buffer algorithms OK for fast real time fancy graphics, not OK for good debugging (shape outline is important on top of z-buffer views). • Vector Postscript (or PDF/SVG) must be available (not Postscript from OpenGL triangles) • see talks about GraXML and Persint Perspective on Future Data Analysis
Example with PERSINT/ATLAS Perspective on Future Data Analysis
Event Displays • The most successful event displays so far were 2-D projections (see Aleph, Atlas/Atlantis) • A lot of work with 3-d graphics in many experiments (see talks about Iguana) • Client-server model • Access to framework objects, browsers • One could have expected a bigger role for Java! • Mismatch with experiment C++ frameworks? • Possible directions • standardize object exchange (SOAP/XML/Root I/O) • standardize low level graphics exchange (HEPREP) Perspective on Future Data Analysis
Histograming • This should be a stable area • Thread Safety • Binning on parallel systems • Merging on batch/parallel systems Perspective on Future Data Analysis
Fitting • Minuit: the standard • Fumili: was nice and fast • Upgrade of Minuit with new algorithms including Fumili in the pipeline • several GUIs on top • a very powerful package developed by BaBar • see talk on RooFit by D.Kirkby Perspective on Future Data Analysis
Statistics & Math • Many tools and algorithms exist • GSL ? • Gnu R-Math project • TerraFerma Initiative • Subject of discussions at many workshops • confidence limits workshops • ACAT FermiLab and Moscow • Durham • Need to be federated in a coherent framework Perspective on Future Data Analysis
Lost with Complexity? • In large collaborations, users are often lost when confronted to the complexity of big simulation and reconstruction programs: • What is the data organization? • How are algorithms organized? The hierarchy? • The problem is amplified by the use of dynamically configurable systems, dynamic linking and polymorphism • Browsing data and algorithms is a must Perspective on Future Data Analysis
Folders/ white boards Folders help understanding complex hierarchical structures Language Independent Could be GRID-aware Perspective on Future Data Analysis
Why Folders ? This diagram shows a system without folders. The objects have pointers to each other to access each other's data. Pointers are an efficient way to share data between classes. However, a direct pointer creates a direct coupling between classes. This design can become a very tangled web of dependencies in a system with a large number of classes. Perspective on Future Data Analysis
Why Folders ? In the diagram below, a reference to the data is in the folder and the consumers refer to the folder rather than each other to access the data. A naming and search service provides an alternative. It loosely couples the classes and greatly enhances I/O operations. In this way, folders separate the data from the algorithms and greatly improve the modularity of an application by minimizing the class dependencies. Perspective on Future Data Analysis