170 likes | 188 Views
Explore the development and coordination of Physics Analysis Tools for efficient data access, schema evolution, and collaborative event management in ATLAS. Learn about T/P separation, schema evolution, event TAG databases, and more. Join the next workshop in Bergen, Norway!
E N D
Physics Analysis Tools Kétévi A. Assamagan BNL
Introduction • Discussion list: • hn-atlas-patdevelopment@cern.ch • hn-atlas-pathelp@cern.ch • hn-atlas-eventview@cern.ch • hn-atlas-edm@cern.ch • Web page and documentation: • https://uimon.cern.ch/twiki/bin/view/Atlas/PhysicsAnalysisTools • Coordination: Kétévi A. Assamagan • Analysis Model: Amir Farbin • Connection to Distributed Analysis: Tadashi Maeno • Representation on the Event Management Board (EMB): Kyle Cranmer • Monte Carlo Truth: Sebastien Binet • Report to Physics Coordinator and SPMB • Meetings • Working group meetings a every 2 weeks • Working meetings at software weeks • Physics Analysis Tools workshops – see the workshop summary reports on the PAT page • PAT workshop at UCL – 2004 First workshop where the AOD was defined • PAT workshop at Tucson, AZ – 2005 • PAT workshop in Tokyo, Japan – 2006 • NEXT PAT WORKSHOP, Bergen, Norway, April 23-27, 2007
PAT Mandate • Objectives: • Propose a unified, baseline framework for Analysis – extend the RTF work to the analysis domain • Common classes for analysis • Tools to build these objects • General/common tools for analysis • Navigations and Associations • TAG based event selection • Interactive Athena tools, Event Display & Visualization • Analysis examples • Common look and feel to the user • Analysis Model • Interface to distributed analysis efforts • Interactions with combined performance groups & physics group • Interaction with the user • User support
Event Data ESD/AOD – the model • The ESD is just a fatter a AOD with more detailed event data • Data can be moved easily from ESD to AOD or removed from AOD • Common tools – User codes run transparently on ESD or ADO • Partial reading for faster data access Information outsourcing DepositInCalo (CaloEnergy) Analysis::Muon CaloCluster Rec::SegmentParticle/ Muon::MuonSegment Rec::TrackParticle AOD Muon::MuonSegment Trk::Track CaloCell Trk::PrepRawData ESD
Faster Data Access egPID IParticle • Information outsourcing … ElementLink<TrackParticle> ElementLink<CaloCluster> egamma egDetail double parameter(string) = 0; double parameter(int) = 0; ElementLinkVector Electron Photon EMShower EMTrackMatch EMBremFit And so on (split store) Still in AOD but sourced from Electron or Photon read on demand
Faster Data Access • Transient/Persistent (T/P) separation • You may have heard things like pAOD for persistent AOD User Analysis Transient Data AOD Persistent Data pAOD Data Conversion Transient Data used during analysis – The transient version is equipped with all the tools/methods needed to analysis the persistent data Optimized for storage space and faster data access during read back in analysis- Persistent data not useful for analysis until converted to its transient version When you save the AOD When you read the AOD The T/P separation has led to significant gain the data storage space and also a significant gain the data access speed during analysis
Schema Evolution • You want to do analysis, combining data taken over the last 6 month-period • How to be able to still read the old data even though, e.g., the Electron class had substantially changed in the meantime? • We are talking about schema evolution for changes that ROOT does not automatically support pAOD Conversion of old data into current transient Electron User Analysis always sees the transient Electron Electron_p1 6 months ago Electron 4 months ago Electron_p2 Last good run Electron_p3 Transient Electron always evolving as needed today Electron_pN Current data stored as the latest xxxx_pN User easily reads old data/new data, or a combination of both in the same analysis But can be painful for developer: Electron_pN and its converters must be implemented to follow the evolution of transient Electron
Event TAG & TAG Databases • ATLAS TAG database is intended to be an event-level metadata system • TAG is intended to support efficient identification and selection of events of interest for a given analysis • Expectation is that, in practice, first-level cuts will be possible (and could greatly reduce candidate sample sizes) via queries that are supportable in relational databases • Nominal space budget in Computing TDR is 1 Kb/event • TAG are produced from AOD, though TAG databases contain or are linked to sufficient navigational information to allow retrieval of event data at all production processing stages, i.e., AOD, ESD or RDO
Event TAG Production • Current Tier 0 model is that TAG will be written into local ROOT files as AOD are written (or as AOD files are merged), and later loaded into a master database at CERN • TAG databases are exported to all Tier 1 centers, and eventually to all Tier 2s • TAG and TAG infrastructure are strongly related to the streaming model: when events are written exactly once, for example, the several groups that may have an interest in the event either add a reference to the event to their group-specific event lists, along with associated metadata (TAG), or flag the event as one of theirs in a master event TAG • Note: Model allows for the possibility that different physics working groups may define their own custom TAG in addition to an experiment-wide common TAG • The Athena-Aware NTuple addresses this use-case • The content of the TAG. for steady state operation, was defined by the TAG working group – for details, see this wiki: https://twiki.cern.ch/twiki/bin/view/Atlas/TagForEventSelection
TAG Based Event Selection - example • To count the number of events in a given category: For example, in some trigger studies, the SQL query: select count(*) from rome_003038_merge_J5_Pt_280_560_AOD_tags where (PJetPt1 > 400000 and PJetPt2 > 350000); gives the number of events in the J5 sample that overlap both the 400 GeV single jet and 350 GeV dijet thresholds. This takes about a second and a half to run directly with the database • To select events for Athena analysis: For example, adding the following to the JobOptions file EventSelector.Connection = 'mysql://reader@lxfs6021.cern.ch/CollectionsRome' EventSelector.CollectionType = 'ExplicitMySQLlt' EventSelector.Query = "(EventNumber BETWEEN 21801 AND 21850) AND (EventNumber<>21814) AND (PJetPt1>60000)“ selects only events that have a 60 GeV jet in them, in a particular event range. The EventNumber <> 21814 bypasses a "killer event" that crashes the code, and this provides a simple way to bypass it. • Other queries • "highest ET jet in the sample", • "most missing ET in the sample", • "poorest balance between events with exactly two jets" (or the five or ten events with the worst balance) and so on. • The time it takes to run these queries: • Less than the Athena initialization times. • The time to run over the sample is of course much faster when one is picking just a few events out of 10000 than when one reads all 10000. T. LeCompte
Athena-Aware NTuple (AAN) • Athena-Aware in the sense that you can use this NTuple as an input to Athena jobs (note that you cannot do this with an ordinary NTuple) • To pre-select events • To navigate back to the AOD and ESD to access the full event • To pass the selected events to the Event Display for visual inspection • The AAN acts as the user’s customized NTuple and also as the user’s own TAG for event selection • Users not yet taking the advantage of the Athena-Awareness of the AAN • The AAN can be created directly from the AOD, the ESD, or in the EventView Analysis
AAN - Pre-selection • Use the AAN as input to Athena job and specify your selection query – exactly as in the case of the TAG: EventSelector.InputCollections=[‘AAN.root’] EventSelector.CollectionType=“ExplicitROOT” EventSelector.Query=EF_e25i > 0 && NElectrons > 0” • Then you can do the following: • Write out the AOD/ESD/RDO of the selected events for further analysis • Run some reconstruction or analysis code on the selected events • Pass the selected events to the Event Display • Same use-case as for the TAG but query made on user-defined variables
EventView • Very popular analysis framework nowadays – Kyle Cranmer, Amir Farbin & Akira Shibata • Overlaps: for example, a given TrackParticle may have been used in: • The electon/photon reconstruction, cluster-Track association • Combined muon reconstruction, muon spectrometer track – Inner Detector Track association/combination • Hadronic tauJet reconstruction: one / 3-prong • Some overlaps removed during AOD making but not all • EventView • Overlap removal – analysis dependent / configurable • UserData, book-keeping of analysis flow/results and more • TopView, SUSYView, HighPtView • Making customized AAN for large groups such physics groups • For details, look at this wiki: https://twiki.cern.ch/twiki/bin/view/Atlas/EventView
Structured Athena-Aware NTuple (SAN) • I do not want to open my NTuple, or any common NTuple to see 700 variables jumping in to my face • I do not want to have to go to some other documentation to understand the meanings of the Tuple variables • If I have “ElectronPt” in my Tuple and someone else has “PtElectron”, I would like to exchange pieces of codes without headaches • We may not use the same overlap removal tools nor the same selections, but I should be able to read and understand your codes, or even use pieces of them • If I develop some tools in ROOT, tools that could of a general interest to a great many people, I would like be able to port that back to Athena with minimal modifications, without triggering a re-validation • I would like to maintain same “look and feel” between Athena based and ROOT based analysis: • electroneta() is electron->eta() whether I do analysis in ROOT or in Athena • I do not want to replicate Athena in ROOT • If my analysis on the Tuple requires b-tagging, vertexing for example, I would do that in Athena, not in ROOT • I want all the above still with the ROOT data access speed of the Tuple The SAN is a generalization of the AAN where some of the TBranches are flat as in in the AAN and some other TBranches have structures as in the AOD. The central Production will produce SAN for you in 12.0.6.x Documentation:https://twiki.cern.ch/twiki/bin/view/Atlas/SAN
ROOT Accessibility of the Analysis EDM • If you open the AOD file in ROOT, what would see is the pAOD since the pAOD is the persistified version of the AOD and there are no automatic converters in the ROOT to transform the pAOD into AOD, as we do in Athena. • Today, the pAOD in ROOT is useless for analysis in ROOT. It does not have the methods. As we said before, pAOD is just the raw data with storage optimization for fast read back • However, the SAN gives you the accessibility of the AOD in ROOT with all the AOD methods. Please check it out for yourself. In fact, try it and give feedback: https://twiki.cern.ch/twiki/bin/view/Atlas/SAN • The AOD format task force recommended that we do not want to have pAOD and SAN as separate data objects: they should be merged; that is, the SAN becomes the pAOD and thus provides the ROOT access to the AOD. In the meantime, the user should start using the SAN – it is like using the AOD directly in ROOT. It is available in release 12.0.6, complete with Trigger stuff. SAN can be produced in the EventView as well • The merged SAN/pAOD will also served as our Derived Physics Data (DPD)
PAT Workshop in Norway • The next PAT work is at the University of Bergen, Norway, April 23-27. Registration is open. Go to the PAT wiki page:https://twiki.cern.ch/twiki/bin/view/Atlas/PhysicsAnalysisTools • The following topic will be on the agenda at the Bergen PAT workshop: • How to pass metadata (luminosity block, run conditions, etc) to the user’s NTuple for calculation of cross-sections – this is not an event related information • ROOT accessibility of the Analysis EDM: a prototype where the SAN is the pAOD • The Trigger content of the TAG, the data quality information in the TAG • Addition of CaloCells and PrepRawData to the AOD impact of the AOD size/access time and on analysis tools • Have we finally converged to a stable analysis EDM that will not be modified and re-modified again? • Jet/ParticleJet integration, T/P separation: this is still lagging behind the rest • Redesign of the TrackParticle for needs of Tracking and Particle aspects of the Track-Particle: implications for Analysis Tools. Do we need a muon version of the TrackParticle? • Distributed analysis – common user interface Ganga/Panda. TAG based event selection in distributed analysis
Conclusions • A lot of progress in PAT but a number of key issues remain to be addressed: • ROOT accessibility of the analysis EDM • Metadata in the user’s NTuple or available to the user somehow without every user having to interrogate the conditions databases • AOD contents (with cells and hits) versus AOD size/speed optimization • Trigger and data quality content of the TAG • Distributed analysis tools