1 / 10

OEP infrastructure issues

OEP infrastructure issues. Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004. Obligatory caveat. I’m available for advice and to provide continuity, but… I won’t be able to undertake any more non-trivial, non-emergency OEP development. What OEP is.

cadee
Download Presentation

OEP infrastructure issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OEP infrastructure issues Gregory Dubois-FelsmannTrigger & Online WorkshopCaltech2 December 2004

  2. Obligatory caveat • I’m available for advice and to provide continuity, but… • I won’t be able to undertake any more non-trivial, non-emergency OEP development (subject)

  3. What OEP is • A conceptual unit of the online system • The framework for processing all complete-event data in the online system • An implementation; a set of code that: • Defines and navigates the raw event structure (both the data from ODF and the persistence of data from Level 3) • Makes this data available to applications in the standard BaBar Framework • Level 3 • Fast Monitoring • Other monitoring applications: event displays, beam spot monitoring, etc. • Provides distributed histogramming services • Controls the lifetimes of all the processes that do this work (subject)

  4. Current status (Up-to-date performance metrics not available because we, unexpectedly, are not running.) • The conceptual design and the event data format have turned out to work well and I don’t see a need to revise them • The performance of the system as implemented has been very satisfactory for several years. • On the old Solaris farm, we were CPU-constrained, but that time was dominated by the performance of the Level 3 algorithms themselves • On the Linux farm we have had lots of headroom even at 1 L3/node… • Until quite recently: Rainer reports that since we started running Fast Monitoring on Linux (i.e., faster) and running the second monitoring farm instance for beam spot measurement, the trickle stream service has become CPU-intensive (subject)

  5. There have been some upgrades • Several iterations of improvement in process lifetime control tools (OepDaemon/OepManager – many thanks to Jim H.)… • … which enabled running more Fast Monitoring processes and additional sets of them • Rewrite from scratch of low-level DHP infrastructure, much fine-tuning • Improvements in logging performance (see Jim’s talk) (subject)

  6. There is more that can be done • Framework overhead, and interface-to-Framework overhead • This was found typically to be about 25% in the old Solaris days • Can address several things: • Framework overhead – Level 3 runs a large number of modules, so this can add up • There may be some effort invested in this motivated by speeding up the physics executables, which have enormous numbers of modules • Interface-to-framework overhead – there’s some unnecessary copying of data that could be eliminated by trickier coding – probably trivial benefit • Event navigation overhead – probably a 10% speedup in Level 3 from the long-planned “fast module scanning” project • This is a fairly straightforward non-multi-threaded programming problem and doesn’t need anything other than a good C++ programmer • One related project, for the record • Making input modules work for non-event data (subject)

  7. Still more that can be done • CPU utilization • We have two CPUs on each farm node • The load from (ODF event level + OEP framework + Level 3 code) is concentrated in a single thread that runs the Level 3 algorithms • Could run two parallel streams of Level 3 processing • Requires a (much) more sophisticated version of the interface-to-Framework OEP code • This was in the original design but was sacrified to 1999-era schedule triage; the need hasn’t been acute enough since then (it only became relevant after the Linux upgrade) • This is a straightforward design but needs to be implemented by someone with a good understanding of multiprocessing • There are some technical questions about DHP and logging, basically:Are the multiple L3 instances to be treated as independent sources, or will they be re-aggregated per node? (subject)

  8. Yet more that can be done • Trickle stream • The Fast Monitoring architecture depends on transferring events over the network from the Level 3 processes, on a sampling basis, to other machines running the monitoring code • Apparently the server side of the existing system is expensive • The long-pending “advanced trickle stream” is being commissioned now. It shares no code with the old protocol, so we’ll have to re-measure this • It doesn’t seem likely to be an intrinsic problem – we receive a higher volume of data on the network from the event builder, very inexpensively • The more sophisticated event distribution system mentioned above would be able to take this load out of the Level 3 process • But one could consider a model in which (some) Fast Monitoring code runs on the same machines that run Level 3 • There are concerns about further eroding the “deadtime firewall” (subject)

  9. Scaling • We run on 30 nodes now. We know we can run on 60 (from experience in the Sun era). • We don’t quite understand the implications of running two (or more) instances of Level 3 per node for scaling of DHP and logging • So the scaling of a (more nodes) x (more processes/node) system is not fully understood (subject)

  10. Conclusions • We will probably need to use one or more of these tools in order to get to 2007 • The development work will require someone with a solid understanding of C++ and multiprocessing. (subject)

More Related