1.74k likes | 1.76k Views
Dive into the intricacies of CDF software used for physics analysis at Fermilab. Gain insight on data tools, grids, and analysis approaches. Learn from real examples and optimize your experiment output effectively.
E N D
Il software di CDF Un caso “complesso” stefano.belforte@ts.infn.it Il software di CDF
Lesson Plan • Hour 1: • Introduction, and overview • CDF and its software problem • Hour 2: • CDF data and analysis software tools • Hour 3: • Farms & Grids • Hour 4: • One physics analysis top to bottom Il software di CDF
Hour 1 Introducing the lessons and CDF software problem Il software di CDF
Your Teacher • Researcher at INFN (Pisa first, Trieste since 5 years) • In CDF (Collider Detector at Fermilab) since 1984 • Researcher in experimental high energy physics: • Silicon detectors, drift chambers • Ultrahigh Vacuum, Digital Electronics, Software on/off-line • Analysis (ppbar@1800GeV:σtot, σela, σdif, top, B mixing, B decays) • Software (Fortran IV, Fortran 90, C, C++, Python, sh) • Computing: since 1998 responsible for computing of CDF – Italy • Which kind of computers are needed ? • How many ? When ? Where ? • To do what ? • NOT A PROGRAMMING EXPERT • CODE SNIPPETS SHOWN HERE WILL NOT COMPILE/RUN • NO C++ QUESTIONS WILL BE ANSWERED !!! Il software di CDF
Our Goal • To learn something about • What is useful • What is a way of doing software that optimizes the scientific output of your experiment • What are the feature that physicists need most • Will do this via one specific example (CDF) • It is not the only example • Quite possibly it is not the best • It worked and still works, but nobody is happy with it • a very common situation • You will have to match to you specific needs, figure out where you can build upon our experience, where upon our mistakes, were you have to ignore CDF and take a different road. Il software di CDF
Overview • Not a code/language lesson • Not a programming course • A survey of “machinery” • A survey of usage of software in a complex system • O(1000) people • O(10) years • CDF = Colossal Data Factory • An exercise in chaos management • A critical (thanks to you) review of our experience • A series of lessons/guidelines • An exercise in common sense Il software di CDF
Content • What is CDF • Online vs. Offline • Data size and structure • Code organization • Code examples • Farms, software/hardware interaction • Grid • One analysis: from data to publication • Data flow and code examples Il software di CDF
Collider Detector at Fermilab A Data Factory Il software di CDF
CDF: where Fermi National Accelerator Laboratory (Fermilab) FNAL 1 Km Il software di CDF
CDF: what 1m Il software di CDF
CDF: why • From this ………………………… to this SOFTWARE • Physics first ! Goal Oriented Software Il software di CDF
Products (cell phones, cars, washers…) Time to Market Products (papers, conference talks. …) Time to Publication Data Factory vs.. Goods Factory (Firm) • We have a clear goal: • More Physics, faster • Goal Oriented Software • Be Pragmatic • Keep you eye on your goal • Beware the quick & dirty fix • Long term vision is needed: • confusion = failure Il software di CDF
No failure allowed • Most modern High Energy Physics experiments (and many INFN projects) are large: • Hundreds of people • Tens of years • Tens of KEuros • Highly visible to scientific and political community • Yes, they are experiments, and are researching the unknown, but.. • Failure is not an option, there must be knowledge output • Not finding SUSY ? ~OK • Detector or software not working ? NO • Constant bottom line: Physics first ! Il software di CDF
Two flavors of software On-line Off-line Il software di CDF
Online : whatever it takes to take data DAQ trigger control monitor calibrations … … Offline : whatever you do once you have the data Reconstruction Storage Cataloguing Handling (move data around, hand them to programs…) Analysis Simulation Plot, Fit Online vs.. Offline Il software di CDF
Online software: DAQ • We will not talk about online software • Notice DataAQuisition • Detector = analog device: particles are detected as charge on capacitors or voltage pulses • CDF = giant digital camera • DAQ: Analog is converted to digital and logged to tape • Snapshot = event • Only a change of representation • Information content is not changed • “All information is there” but needs to be “extracted” • Offline data analysis always start form a collection of pairs • (value, address) ~ 100K such pairs Il software di CDF
Analysis = Data Reduction • 10^6 electronic channels • 10^5 data-address pair for each event • 10^2 reconstructed 4-vectors per event • 10 high level “physics objects” per event • Jets, electrons, mu, D (B) candidates … • 10^2 events per second • 10 – 10^6 events in final plot • 10^1 fit parameters • 10^2 fits (MC, systematic) • 1 measurement (3 numbers: value +- stat +- syst) Il software di CDF
From digitized data to 4-vectors • P-Pbar collide, O(100) particles emerge • From event snapshot to a list of 4-vectors • Momentum, Energy Il software di CDF
Then to Physics Objects • One top event at CDF • pp tt • t Wb W b-jet • W e nu(=MET) • t Wb W b-jet • Wqq jj • Electron • Missing Energy • 4 jets • 2 b-jets • Long lived b quarks, displaced vertex Il software di CDF
Then to Physics Measurements • Combine objects in one event to compute the value of a kinematical quantity (e.g.. invariant mass of W+4jets) and plot distribution for all events • Use it to measure a physics observable (e.g.. mass of the top quark) Il software di CDF
The Complexity of CDF Software Il software di CDF
The numbers (rounded) • 800 CDF collaborators • 700 “that run jobs” (users that asked for a queue on analysis farm) • 10 years of code development • 6 years of data taking • 10 years of data analysis • 1M lines of offline code • 1TB data logged on tape every day • 500TB data to analyze every year • 10^9 events per year to analyze • 1M files to handle every year • 1000 GHz CPU power in the analysis farm • 200 TB disk space for analysis • 10 remote farms for analysis and MC • Uncounted/uncountable local clusters (50 institutions) • 100 “librarians” • 760 “CVS authors” (people who could write in CVS at one time or another) An exercise in chaos management Il software di CDF
Chaos Management • Not possible to “outsource software to professionals” • Do not know requirements until analysis is done. It is a fact, not a guilty. We do research ! • 4 kind of programmers (BCD physicists, often also A) • A Professionals • Write core • Data format, I/O, framework, DB … • CDF: 3~4 peoples • B Semi-Pro • Write e.g.. reconstruction code • CDF: ~20 persons • C Experts • Understand code written by B • Augment that with small pieces • Teach D what to do • CDF: ~50 persons • D Amateurs • Do not contribute code to CVS • Work by editing examples • Write simple analysis algorithms and fill ntuples or histograms Il software di CDF
SW developers community • A : define architecture • B : define data, classes, complex algorithms • C : hand B work to the masses, add features • D : work by “monkey see, monkey do”, often write in “Fortran with semicolon” • The perfect language for a procedural algorithm • Of course boundaries are fuzzy (esp. C/D) • We will only talk about D • It is 90% of users, anyhow • Look at what A and B do only for the effect on D of their actions, will not look at their code • Understanding D needs is vital for A,B,C work to be fruitful Il software di CDF
Guidelines for software development A collection of common sense Il software di CDF
RULES OF THE GAME design think code test • How to know “what to do” ? • Will not find it in the book, the manual, the examples • We do research, every problem we tackle, is tackled for the first time • Learn from experience • If it lacks, have to make itthe hard way • softwaredevelopment cycle Il software di CDF
The Open Source Community Rules • Release early • Release often • Listen to users • User = you, double role: developer/user • First thing: walk through all the circle ! • True for physics difficulties as well for software ones • Touch all aspects • See where troubles/hurdles/bottlenecks lies • THEN optimize Il software di CDF
How to build a complex system • “A complex system that works is found to have evolved from a simple system that worked… A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system” • G.Booch OO Analysis & Design 2nd ed. Pag. 13 • Was written thinking about large (“industrial strength”) SW projects that outlive their creators and are so large that none can understand them fully • But it is also true for HW projects (farms etc.) • And also for your code • Since it is O(1man/year) of work, it can be redone from scratch in theory • But you will need to re-use/extend/improve it faster Il software di CDF
Document • 6 month after you wrote, you will have forgotten it • Think of the person who will have to pick up you code, update, maintain, expand, debug it ! • That person may be you !! • Your code will outlive you: • Never treat it like “it will be thrown away” • You have to trash it yourself • One day your main program will be turned into a function and those missing destructors will be direly missed. With luck, program will crash, elsewhere. • The worst pieces of CDF software are always found in the code whose author has left Il software di CDF
Hour 1 Summary • Goal of these lectures • What is CDF • What are its goals • What are the numbers of the CDF data • What are the tasks of the CDF offline software • How complex is the CDF software development environment • What the basic guidelines to write useful software in a complex environment Il software di CDF
Hour 2 CDF software tools Il software di CDF
CDF Analysiscode How we tamed complexity Il software di CDF
CDF • How to make it “human” to use and maintain 1 Million lines of code to handle 1000 Millions events • Data Access: OO approach (this means more then C++) • abstraction • encapsulation • information hiding • Data Structure (Event Data Model: EDM) +framework (encapsulation of data access) • Files (data containers) management: Data Handling Il software di CDF
A Framework provides utilities and services. Il software di CDF
A framework for CDF • CDF framework’s goal is to enable users to navigate the interesting objects in the event • Access to all level is allowed, from raw data to physics objects • Most of the time the user will want to look at not-refined phyiscs objects • Useful to think in terms of user wanting to access • muons • electrons • jets • tracks • The framework is the tool to made that easy Il software di CDF
CDF software structure Many independent “modules” The data in a disk file My software goes in here I/O module The DATA some transient representation Your software goes in here EDM Her software goes in here Framework Il software di CDF
CDF SW Components • Event Data Model: Manages reading/writing of events to/from data files, and manages access to objects in an event • Data Handling System: Manages file repository, delivers files • Framework: Directs execution/configuration of software modules Il software di CDF
EDM • All data passed from module to module is passed via an instance of the class EventRecord • No singletons or globals may be used to pass event data • Objects derived from class StorableObject can be stored in the event record • The event is an STL-based container of StorableObject pointers • Storable objects must be allocated on the heap and referred to using instances of the Handle class • A handle is a smart pointer to avoid data copies • Objects are assigned an object id by the event when stored, and then become read-only • Users cannot delete objects in the event record, they can only classify the object as being desirable or undesirable for output. Il software di CDF
The CDF framework : AC++ • Common to CDF and BaBar • From user’s point of view, provides an easy way to ask “the computer” to: • Read all events in a collection of files (dataset) • On each event execute the following “standard” actions, using this set of parameters • select events based on some criteria (trigger type, result of following actions) • fit track • cluster energy in jets • …. • On each event then run my analysis code (module) • AC++ gives easy/safe access to event data • Write a subset of selected events to file(s) Il software di CDF
In practice: ./myExe my.tcl ## my.tcl ### # Specify the histogram output file. talk HepRootManager histfile set trackplots.root exit # Enable a path, and verify it. module enable MyModule path enable AllPath show path # Specify an input file name module talk DHInput include file $env(SRT_DIST)/valfiles/data/jpsi_skim.dst include dataset xpmm0d exit # talk to the module, set a minimum pT for filtering, and turn filtering on. talk ExampleTrackAnalysis ptCut set 5.0 show exit filter ExampleTrackAnalysis on # Optional -- if you wish, create an output file. # (If filtering is enabled, then the output will satisfy the filter.) talk FileOutput output create mystream track_skim.dst output path mystream AllPath show exit # Process the file, then exit the program begin –nev 10 continue exit Define global Select which of the modules you linked will run Select input data Set parameters inside the code Use a module to make decisions on whether to run or not other modules in the path (output module e.g..) Write output data Control execution Il software di CDF
How to include my module in AC++ AppUserBuild::AppUserBuild( AppFramework* theFramework ) : AppBuild( theFramework ) { addCDFrequiredModules( this ); // This is needed because the ExampleTrackAnalysis module wants to // look at the output of ProductionExe addAllStorableObjects( ); AppModule* aModule; // This is a utility module whose only purpose is to select a histogram // manager type: // aModule = new HepHbookManager( ); // comment out the next line and uncomment the above line if you want // HBOOK instead. aModule = new HepRootManager( ); add( aModule ); aModule = new MyModule( ); add( aModule ); aModule = new ExampleTrackAnalysis( ); add( aModule ); // Any any other modules you want to link here... } There is a “standard” C++ file to edit “standard” crap don’t touch it You have to change this Il software di CDF
//--------------------------------------------------------------------------//-------------------------------------------------------------------------- // Class MyModule. This is a simple example of a user module. It // books a histogram, fills it. //----------------------- // This Class's Header -- //----------------------- #include "ExampleMyModule/MyModule.hh" //------------------------------- // Collaborating Class Headers -- //------------------------------- #include "HepTuple/HepHist1D.h" #include "Edm/EventRecord.hh" #include "Edm/ConstHandle.hh" #include "MuonObjects/CdfMuonView.hh" //---------------- // Constructors -- //---------------- MyModule::MyModule( const char* const theName, const char* const theDescription ) : HepHistModule( theName, theDescription ) { } //-------------- // Destructor -- //-------------- MyModule::~MyModule( ) { } //-------------- // Operations -- //-------------- AppResult MyModule::beginJob( AbsEvent* aJob ) { HepFileManager* manager = fileManager( ); assert( 0 != manager ); _MassHisto = &manager->hist1D( "Pair Mass", 50 , 2.6, 3.6, 101 ); assert( 0 != _MassHisto ); return AppResult::OK; } MyModule.cc 1 Many more Histograms … You want to look at reconstructed muons in your module Il software di CDF
AppResult MyModule::event( AbsEvent* anEvent ) { //Get the muons out of the event record CdfMuonView_h muonView_hndl; bool muonStatus = CdfMuonView::allMuons(muonView_hndl); if (muonStatus == true) { //Do double loop over pairs of muons. for (CdfMuonView::const_iterator muIter1 = muonView_hndl->contents().begin(); muIter1 != muonView_hndl->contents().end(); muIter1++) { const CdfMuon& muon1 = **muIter1; // if (muon1.fourMomentum().perp() > 1.5 && fabs(muon1.bestTrack()->pseudoRapidity()) < 1.1 && muon1.emEnergy() < 2.0){ //Require good track stub matching if ((muon1.cmu().drphi() < 15.0) || (muon1.cmp().drphi() < 15.0)) { //Loop for second muon for (CdfMuonView::const_iterator muIter2 = muIter1+1; muIter2 != muonView_hndl->contents().end(); muIter2++) { } } } } } } return AppResult::OK; } const CdfMuon& muon2 = **muIter2; if (muon2.fourMomentum().perp() > 1.5 && fabs(muon2.bestTrack()->pseudoRapidity()) < 1.1 && muon2.emEnergy() < 2.0){ if ((muon2.cmu().drphi() < 15.0) || (muon2.cmp().drphi() < 15.0)) { //Cut on dimuon charge if (muon1.bestTrack()->charge() * muon2.bestTrack()->charge() == -1) { double mass = (muon1.fourMomentum() + muon2.fourMomentum()).m(); _MassHisto->accumulate( mass ); } } MyModule.cc 2 “event” entry point, here goes the custom stuff AC++ Gets You The muons This is the code you really have to write Il software di CDF
AppResult ExampleTrackAnalysis::event( AbsEvent* anEvent ) { //By default, this event does not pass the filter. bool filter_pass = false; //Access the "default" set of tracks in the event CdfTrackView_h hView; // This is the handle for the "view." if (CdfTrackView::defTracks(hView) == CdfTrackView::OK) { // The view is now filled with the default tracks, so extract contents. const CdfTrackView::CollType & tracks = hView->contents(); // Now loop over the tracks, doing a double-dereference to get at each. for (CdfTrackView::const_iterator it = tracks.begin(); it != tracks.end(); ++it) { const CdfTrack & track = **it; //Extract the helix parameters (or equivalent) for each track float pt = trk.pt(); float phi0 = trk.phi0(); float d0 = trk.d0(); float z0 = trk.z0(); float eta = trk.pseudoRapidity(); int alg = trk.algorithm().value(); //Fill ntuple. _ntuple->capture("run",AbsEnv::instance()-> runNumber()); _ntuple->capture("event",AbsEnv::instance()-> trigNumber()); _ntuple->capture("trknum",(int)trk.id().value()); _ntuple->capture("pT",pt); _ntuple->capture("phi0",phi0); _ntuple->capture("d0",d0); _ntuple->capture("z0",z0); _ntuple->capture("eta",eta); _ntuple->capture("alg",alg); _ntuple->storeCapturedData(); _ntuple->clearData(); //Fill histograms. _ptHisto->accumulate(pt); _phi0Histo->accumulate(phi0*180.0/M_PI); _d0Histo->accumulate(d0); _z0Histo->accumulate(z0); _etaHisto->accumulate(eta); _algHisto->accumulate(alg); _paralgHisto->accumulate(paralg); //Make pT cut for filter. filter_pass |= (pt >= _ptCut.value()); } } this->setPassed(filter_pass); return AppResult::OK; } Another example : tracks Framework stuff Standard stuff Your code Il software di CDF
More “hiding” • Use shared libs • User never recompiles, nor relinks the framework • User’s module is loaded as shared lib • Only user’s code need to be compiled each time • If you want a “static” executable (e.g.. to run “anywhere”) • export USESHLIBS=0 This is the GNUmakefile for compile and link, you will never need to understand it, just replace ExampleMyModule with your module’s Name (or keep the name and change the code inside, as 80% of your collegues do), then type “make” # example Makefile for CDF packages # include file products INC = # library product LIB = libExampleMyModule.a # library contents LIBCCFILES = $(filter-out $(skip_files), $(wildcard *.cc)) LIBFFILES = $(wildcard *.F) LIBCFILES = $(wildcard *.c) override LINK_ExampleMyModule += ExampleMyModule override LINK_FrameMods += ExampleMyModule override LINK_TrackingMods += ExampleMyModule override LINK_FrameMods_root += ExampleMyModule override LINK_FrameMods_dump += ExampleMyModule LOADLIBES += -lExampleMyModule -include PackageList/link_all.mk # binary products BINS = ExampleMyModule_test ############################## include SoftRelTools/arch_spec_STL.mk include SoftRelTools/standard.mk Il software di CDF
Benefit of shared libs • Need fast code/test cycle • Symbolic debug vital, but once the bug is found and fixed, you have to re-run to find the new one • Much, much, much more time goes in debug and test then in code the first time (and often even much more then in running on data eventually) • Never undervalue the debugging step • Design to make debug easy: it will pay ! Il software di CDF
More hiding: the Dataset • More hiding and encapsulation • 500TB/year = 10^6 files/year • Can not run on “list of files” • Which is on disk ? Which is on tape ? Which disk ? Which tape ? • Metadata: Data File Catalog • Friendly interface in AC++ Framework module talk DHInput include dataset xpmm0d Exit • Data files already on disk are processed before those that are still on tape. Reuse disk cache ! User do not care of processing events in order Il software di CDF
Datasets on the web Il software di CDF
Your friend the Dataset • A Dataset is much more then a collection of files • By being a common name used by all, it has : • History: trigger, selection • Properties: # of files, luminosity • Added value by other analysis: trigger efficiency, purity, background • Easy way to move it around and to know if it is available in a certain place • Dataset concept abstracts common features for file collections, encapsulates in a name, hides list of files • Abstraction • Encapsulation • Information Hiding Il software di CDF
Dataset outside Fermilab • You can imagine (and we have) a tool that imports a data set from Fermilab to your home computers given simply the name, handle network errors etc. Il software di CDF