430 likes | 673 Views
Data Analysis in Experimental Particle Physics. Lectures at the CERN-CLAF School, 13-14 May 2001, Itacuru ç a, Brazil Prof. Manuel Delfino CERN Information Technology Division* * Permanent address: Departamento de F ísica , Universidad Aut ónoma de Barcelona, España.
E N D
Data Analysis in Experimental Particle Physics Lectures at the CERN-CLAF School, 13-14 May 2001, Itacuruça, BrazilProf. Manuel DelfinoCERN Information Technology Division** Permanent address: Departamento de Física, Universidad Autónoma de Barcelona, España Data Analysis / M. Delfino / CERN IT Division
Data Analysis in Particle PhysicsOutline of Lecture 1 • Characteristics of data from particle experiments • From DAQ data to Event Records:Event Building • From hits to tracks and clusters • From tracks and clusters to “particles”:Correlating sub-detector information • Uncertainties and resolution • Data reconstruction and “production”: Data Summary “Tapes” • Personal data analysis: n-tuples Data Analysis / M. Delfino / CERN IT Division
Data Analysis in Particle PhysicsOutline of Lecture 2 • Monte Carlo simulation • Statistics and error analysis • Hypothesis testing • Simulation of particle production and interactions with the detector • Digital representations of event data • Monitoring and Calibration • Why physicists don’t (yet) use Excel and Oracle for their daily analysis. • The challenge of analysis for the LHC experiments • The challenge of computing for the LHC • Solving the LHC computing challenge Data Analysis / M. Delfino / CERN IT Division
Characteristics of data from particle experiments Data Analysis / M. Delfino / CERN IT Division
Characteristics of data from particle experiments • Most data comes from digitized information from sensors activated by particles crossing them. • We call the data resulting from the observation of a particle collision an event. • During hours, days, weeks, months, years or even decades, we observe many events. We group them according to the time-varying experimental conditions into runs. • Calibration and environmental information is also stored, usually in a periodic fashion. • For practical reasons, this data is stored in data files of many events. • Almost always, events are independent from each other. Data Analysis / M. Delfino / CERN IT Division
Characteristics of data from particle experiments Calibration records The Experimental Particle Physics Data Worm Data file 418 Data file 419 Run 139 Run 137 Run 140 Run 138 Event number 31896 Data Analysis / M. Delfino / CERN IT Division
From DAQ data to Event Records“Event Building” Data Analysis / M. Delfino / CERN IT Division
From hits to tracks and clusters Data Analysis / M. Delfino / CERN IT Division
From hits to tracks and clusters Occupancy and point resolution are related to ambiguities in track finding Data Analysis / M. Delfino / CERN IT Division
From hits to tracks and clusters Calibration, monitoring and software are needed to resolve these ambiguities Data Analysis / M. Delfino / CERN IT Division
From hits to tracks and clusters What you see is not always what there was ! Nuclear interaction Data Analysis / M. Delfino / CERN IT Division
Monitoring and Calibration • Particles deposit energy in sensors • Sensors give Voltages, Currents, Charges • Space position of sensor is known • On-detector Analog-to-Digital Converters change these into numbers representing these or other quantities (for example clock-ticks between V pulses) • Calibration establishes the relationship between the ADC units and the physical units (eV, {x,y,z}, ns) • In the laboratory, using controlled conditions • In the field, using known physical processes • The calibration can depend on environment or drift due to uncontrolled parameters: Monitoring Data Analysis / M. Delfino / CERN IT Division
From tracks and clusters to “particles”Correlating sub-detector information m e Data Analysis / M. Delfino / CERN IT Division
Uncertainties and resolution • Each measurement or hit has some uncertainty, due to alignment and the characteristic of the sensor. • These uncertainties get propagated, often in a non-linear manner, to resolution functions for the physics quantities used in analysis. • Resolution has various consequences: • Direct on measurements • Signal-Background confusion • Combinatorics Particle-ID with dE/dx in theTPC Note different scales Data Analysis / M. Delfino / CERN IT Division
Data reconstruction and “production”:Data Summary “Tapes” • Reconstruction turns hits+calibration+geometry into particle hypothesis • Reconstruction is time consuming and must be made coherently Centrally organized production • Output is one or more levels of so-called Data Summary Tapes (DST) which are used as input to Personal Analysis • In practice, there is a lot of utility software to organize these data for easy analysis (bookkeeping) • Programming of complicated event structures • Old: FORTRAN with home-made memory managers • Today: Object-Oriented design using C++ or Java Data Analysis / M. Delfino / CERN IT Division
Personal data analysis • Most modern detectors can address multiple physics topics. • Hundreds or thousands of professors and students distributed around the world. • Modern experimental collaborations are early example of virtual communities. • Historical enablers for virtual communities: • Fellowship and exchange programmes • Telegraph, telex, telephone and telefax • National and International Laboratories • Reasonably priced airline tickets • Computer inter-networking, e-mail and ftp • The World Wide Web • Multi-media applications on the Internet Data Analysis / M. Delfino / CERN IT Division
Personal data analysis • Today, physics analysis topics are increasingly tackled by virtual teams within these virtual communities. • Must maintain coherency of data and algorithms within the virtual team. • “Production” for a modern detector is very complex and consumes many resources. • DST contains all imagined reconstruction objects for all foreseen analysis, so they are big. • Handling a DST often requires installation of special software libraries and writing code in “reconstruction dialect”. Data Analysis / M. Delfino / CERN IT Division
Personal data analysis • Solution: Each virtual team develops a code to extract a common analysis dataset for a given topic which is written and manipulated using a “lingua franca”:n-tuples and the Physics Analysis Workstation (PAW) • Physicist’s version of business data mining with Excel • Iterative process (time-scale of weeks or months): • Team agrees on complex algorithms to be coded in the extraction program. • Algorithms coded and tested, extraction from DST. • n-tuple file is rapidly distributed via computer network. • n-tuple is analyzed using non-compiled platform-independent code (PAW macros today, Java in future ?) that are easily modified and shared by e-mail. • Eventually limitations are reached, go back to step 1. Data Analysis / M. Delfino / CERN IT Division
Personal data analysis • PAW was the “killer application” for physics in the 90s • Interactive, just as powerful workstations became available • Platform independent, in a very diverse workstation world • Graphical, just as X-windows gave graphics over network • Simple to write analysis macros, just as the complexity of FORTRAN programming required in experiments decoupled most of the collaborators from the experiment’s code. • In summary, PAW was like going from DOS to Macintosh. • One major limitation of PAW is the lack of variable length structures or more generally data objects. • ROOT overcomes these limitations keeping a similar philosophy as PAW. • Java Analysis Studio tries to go further with “agents”. Data Analysis / M. Delfino / CERN IT Division
Personal data analysis • Which will be the “killer application” for LHC analysis? • Is a Mac Classic on Appletalk enough or do we need the conceptual leap equivalent of Web + Java-enabled browser? • Will the personal n-tuple model work for LHC ? • Do we need and can we afford to support our own interactive data analysis tool ? • Will one of the newer tools, such as Java Analysis Studio, go exponential in the open source world ? • Many questions, one simple answer:It will be young people like you who will make the next step happen. Data Analysis / M. Delfino / CERN IT Division
Monte Carlo simulation • Monte Carlo simulation uses random numbers( mathematics textbooks) • Try the following: • Find a source of random numbers in the interval [0,1] (calculator, Excel, etc.) • Take a function that you want to simulate (e.g. y=x2) and normalize it to fit in the interval [0,1] for both x and y. • Find graph paper to histogram values of x • Repeat this at least 20 times: • Throw two random numbers. Use first as value for x • Evaluate the function y and compare its value to 2nd random number • If function value is less than random number, add a count to histogram in the correct bin for x • If function value is more than random number, forget it • Compare your histogram to the shape of the function Data Analysis / M. Delfino / CERN IT Division
Monte Carlo simulation • If you don’t know how to program, you can pick up an Excel file from http://cern.ch/Manuel.Delfino/Brazil • Here is the resultfor 100 trials: • Note there are30 entries so the“efficiency” is 30% • Note the statisticalfluctuations • Homework: How is the normalization done ? Data Analysis / M. Delfino / CERN IT Division
Statistics and error analysis • Analysis involves selecting, counting and normalizing. • Things are easier when you actually have a signal. • Understand underlying statistics: Poisson, Binomial,Multinomial, etc. • If measuring a differential distribution, understand relation between normalization of binned counts vs. total counts. • Understand selection biases and their impact on observed distributions. • Things are a lot harder when you place limits. • Two observations: • If you cannot make an analytical estimate of the uncertainties, I won’t believe your result. • The expression “n-sigma effect” should be banned. Data Analysis / M. Delfino / CERN IT Division
Hypothesis testing • You must understand Bayes’ theorem.And every time you think you understand it, you must make a big effort to understand it better ! • Compare differential distributions of data with predictions of “theory” or “model” • Different theories • Different parameters for same model • Setting up the statistical test is often straight-forward, which is why it is surprising most people do it wrong • Taking account of resolution and systematic uncertainties is hard • Make simulation look like data to get your answers • Even if graphics looks better the other way around !!! Data Analysis / M. Delfino / CERN IT Division
Simulation of particle production and interactions with the detector • For particle production, combine Monte Carlo with • Detailed particle properties • Detailed cross-sections predicted by theory of phenomenology • Computation of phase-space • Output consists of event records containing simulated particles (often called 4-vectors by experimentalists) • For simulating the detector, combine MC with • Detailed description of the detector • Detailed cross-sections for interaction with detector materials • Detailed phenomenology of mechanism producing signal • Transport (Ray-tracing) algorithms including B fields • Digitization model mapping of {x,y,z} to read-out channel Data Analysis / M. Delfino / CERN IT Division
Simulation of particle production and interactions with the detector Example:Small part of design of GEANT4 Reference to Jackson’s textboook in documentation ! Data Analysis / M. Delfino / CERN IT Division
Digital representations of event data • In principle, representing event data digitally should be very simple, except: • everything comes in variable numbers: hits, tracks, clusters • ambiguities lead to multiple relations • particle identification may depend on analysis hypothesis • etc. • In simple terms, events don’t look like bank account data, they look like collections of objects. • You can do a reasonable representation using relational tables, but actually using the data structures from Fortran programs is still cumbersome • Object Oriented Programming is a better match, but C++ does not resolve all problems Frameworks Data Analysis / M. Delfino / CERN IT Division
Why physicists don’t (yet) use Excel and Oracle for their daily analysis. • Spreadsheets like Excel and relational databases like Oracle have a very “square” view of data.This is not a good match to the Data Worm. • “Normal” people (banks and insurance companies) can define a priori the quantities that they will select on (the keys of the database).We usually derive selection criteria a posteriori using quantities calculated from the stored data. • We like (need ?) to express queries as individualistic detailed low-level computer codes. Difficult to support in database. • But this is changing very rapidly due to Data Mining:Businesses are interested in analyzing their raw data in unpredictable ways.Example: Cash register tickets to choose sale items • Support for this requires a more “organic” view of data, for example object-relational databases. Data Analysis / M. Delfino / CERN IT Division
Why physicists don’t (yet) use Excel and Oracle for their daily analysis. Cluster Particle hypothesis Track Tracker hit Calorimeter hit PositionResponse Position WidthDepthEnergyNumber of hits PositionResponse Origin Curvature Extrapolation Number of hits MassChargeMomentumOrigin Idealized One to Many One to Many One to Many One to Many Simple relation Data Analysis / M. Delfino / CERN IT Division
Why physicists don’t (yet) use Excel and Oracle for their daily analysis. Cluster Particle hypothesis Track Tracker hit Calorimeter hit PositionResponse Position WidthDepthEnergyNumber of hits PositionResponse Origin Curvature Extrapolation Number of hits MassChargeMomentumOrigin Reality Many to Many Many to Many Many to Many Many to Many Complicated algorithmic relation Data Analysis / M. Delfino / CERN IT Division
The challenge of analysis for the LHC experiments Data Analysis / M. Delfino / CERN IT Division
The challenge of analysis for the LHC experiments Online1:107 1:1012 Analysis1:105 Data Analysis / M. Delfino / CERN IT Division
The challenge of analysis for the LHC experiments Data Analysis / M. Delfino / CERN IT Division
The challenge of analysis for the LHC experiments 250K SI95 35K SI95 One Experiment Event Filter (selection & reconstruction) Detector ~200 MB/sec 350K SI95 64 GB/sec Event Summary Data 1 PB / year Raw data Batch Physics Analysis 500 TB Event Reconstruction ~100 MB/sec analysis objects 0.1 to 1GB/sec Event Simulation Thousands of scientists distributed around the planet Data Analysis / M. Delfino / CERN IT Division
The challenge of computing for the LHC Long Term Tape Storage Estimates TeraBytes 14'000 12'000 10'000 LHC 8'000 6'000 4'000 Current Experiments COMPASS 2'000 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Data Analysis / M. Delfino / CERN IT Division
The challenge of computing for the LHC Long Term Tape Storage Estimates TeraBytes Accumulation: 10 PB/yearSignal/Background up to 1:1012 14'000 12'000 10'000 LHC 8'000 6'000 4'000 Current Experiments COMPASS 2'000 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year Data Analysis / M. Delfino / CERN IT Division
The challenge of computing for the LHC Estimated CPU Capacity required at CERN K SI95 5,000 Moore’s law – some measure of the capacity technology advances provide for a constant number of processors or investment 4,000 LHC 3,000 2,000 1,000 0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Jan 2000:3.5K SI95 Data Analysis / M. Delfino / CERN IT Division
The challenge of computing for the LHC Data Analysis / M. Delfino / CERN IT Division
The challenge of computing for the LHC Continued innovation Data Analysis / M. Delfino / CERN IT Division
Solving the LHC Computing Challenge:Technology Development Domains DEVELOPER VIEW GRID FABRIC USER VIEW APPLICATION Data Analysis / M. Delfino / CERN IT Division
Solving the LHC Computing Challenge StorageNetwork 12 10 Thousand dual-CPU boxes 1.5 0.8 8 6 * Multi-Gigabit Ethernet switches 24 * FarmNetwork 0.8 960 * Hundreds oftape drives * Data Ratein Gbps Real-timedetector data LAN-WAN Routers 250 Grid Interface Storage Network 5 0.8 10 Thousand disk units Computing fabric at CERN (2006) Data Analysis / M. Delfino / CERN IT Division
Solving the LHC Computing Challenge:Data-Intensive Grid Research Application “Specialized services”: user- or appln-specific distributed services Application User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Grid Protocol Architecture Data Analysis / M. Delfino / CERN IT Division
Acknowledgements • Many of the figures in this talk are from the Web sites of ATLAS, CMS, Aleph and Delphi. • Thanks to Markus Elsing for Delphi displays of tracking and nuclear interaction. • GEANT4 design diagram from the documentation. • Thanks to Les Robertson for LHC Computing diagrams. • Grid architecture diagram adapted from Ian Foster. Data Analysis / M. Delfino / CERN IT Division