What has BaBar learned?

What has BaBar learned? Background Experiences & Tools Key Issues Summary Xnnn Related talks

Project Statistics • CP B physics requires a 30 fb-1 yearly sample for > 5 years • One year is 30M B events, 120M hadronic, 1.2B Bhabhas seen • “Factory mode” running for greater than 80% of real time • 100Hz of accepted L3 triggers to be read, 30Hz to fully process • Roughly 3MB/sec of raw data, all day, every day • Similar size downstream processing and analysis streams • High capability detector • 5 layer Silicon Vertex tracker • 40 layer low mass drift chamber • Novel “DIRC” particle ID • Crystal calorimeter • Highly segmented instrumented flux return • But life is never easy • Severe machine backgrounds • Significant compromises in geometry & regularity

Worldwide Collaboration of 80 Institutes

Offline Computing Goal: "Factory Running” • Do physics with a time lag that as small as possible • Data to submission of paper in 6 months (asymptotically) • Strategy: • Volume processing of data, MC to a high level • Including physics algorithms and selections in first pass processing • Requires high-quality results from the start • Remainder of work on specific samples • Need efficient access to subsets from large to small • Reprocessing in detail

Does this strategy work? • Too early to tell for sure • First you have to get it running, then see if it improves ability to do physics • Still bringing the system up • By next summer, expect to know more • All is not sweetness and light • Can the system keep up with billions of events and hundreds of physicists? • Our event store is not yet transparent • Throughput problems • Data distribution problems • Still trying to get granularity right • “Can senior people with good intuition contribute?”

Running Experience • First collision data was May 26, 1999 • First events processed that same shift • In subsequent 230 days: • 194M colliding-beam events were recorded in 3224 runs • 250M events were reconstructed (some more than once) • 34TB of data were stored on 900 HPSS tapes • Several high visibility problems • Not keeping up with data • At all levels • Lots of algorithm work to do • Calibrations slow to converge

Fast startup

Overview • Have to refer you to previous explanations of the BaBar approach • Obstacles as of 1995 • Changes in transition • We stress cyclic involvement, evolvabilty • The Martin Uncertainty Principle: • "The problem cannot truly be understood until the solution exists.” • We pushed out new technologies fast • Used flexibility of system to adapt to improved understanding • You store up trouble this way • Not everything gets completely updated • Start to have trouble rememering/understanding our reasons • Are we capturing what we know? • Was running on day one, continues to improve • Capability comparable or better than past experiments • Coping with a large processing load • Flexibility is still important

Obstacles as of 1995 • Nothing to build on • Lots to do • Too many possibilities for the first pieces • Inverted schedule • Analysis and design are high skill activities • But have to come at the beginning, before skills are developed • No clear agreement on product • Everybody knows what a track finder does, but few agree • Real product is flexibility • Waiting for that “smart idea” in 2003 • Not much expertise & effort available • Much existing expertise of doubtful applicablity • C++ advocates had limited design experience • Mismatch between enthusiasm and effectiveness • I expect these are common issues beyond BaBar CHEP97 slide

What changes in the transition to OO/C++? • Better/worse? • FORTRAN77 Better • perhaps with VAX extensions • “Extensions” to add functionality and control • ZEBRA/BOS/... some native, • ADAMO some missing • Code management tools • HISTORIAN, homegrown tools easier, but needed development • Standards, practices, policies • Common lore of the HEP programming community missing • Design idioms and normal practices missing • Locally developed, customized, documented missing • Programmer skill, commitment and ingenuity Much Worse • “If you expect a language to solve all your problems, you don’t have interesting problems” - A. Koenig CHEP97 slide

Multi-prong attack • Architecture • Solving the “too many first choices” problem • OO design expressing “traditional” concepts • Iterative design & implementation process • Applying “evolutionary pressure” • Design and implementation intertwined • Strike multiple balances • Learn by doing, do while learning • “Getting something in place” vs. design work • In spite of imposed waterfall-model schedule • Gain control of the process by controlling the product • Code management • Quality control and assurance • Team-building • Getting the people • Formal training balancing experience and exposure CHEP97 slide

Design it What next? Code it Release & use it Our “design process” • We use an evolutionary approach • People enter coding • Eventually, they start to draw clouds and blobs • Many of them become good designers • Evolution improves the system • Relevant code is used • Comments are not always gentle • Release system controls the pace • Biweekly timescale • New designs, redesigns are ongoing • Driven by perceived needs • Policy: “Get them engaged, then work with them” CHEP97 slide

C++ & OO • People will write C++ • Structure varies a lot • FORTRAN with ; and #include • C with abstract datatypes • "The True Style” (whatever that means) • data hiding • reuse by inheritance • abstract interfaces • generic programming • Flexibility is both a strength and a weakness • We’ve had some very significant successes • Calibration model • Track model • Physics analysis tools C106 B112 A328

“C++ is harder to learn than FORTRAN” • Unfortunate, but true • Perhaps you can justify it • Can lead to mistakes • Need efforts to limit impact • Training • Mentoring • C++ and especially OO puts off a number of senior, experienced people • Even with specific efforts to couple in, this has cost us • PI's less likely than postdocs to contribute to reconstruction and simulation • Will it extend to analysis? For how long?

“Our code is slow” • Strategy was structure & function first, then worry about time • Lessons: • Make sure you understand which are the most severe problems • If it gets too far out of hand, you strain working relations • You can never catch up with a bad impression

Natural trend is upward • Updated algorithms always get more complex, esp, when real data arrives • New algorithms run in parallel with existing ones • Generality costs speed, but still too early to sacrifice • But a lot of code is just inefficient • Ongoing attention to detail recovers large performance increments

“Still hard to find bugs” • New places for them to hide • Harder to even know they exist • "C++ is a pig of a language from a memory leak point of view” • Need unfamiliar tools • Purify, Great Circle, etc • Need to be routinely run • Which means centrally • We do per release • But nobody wants that job

“But easier to scale/modify/adapt” • Examples • Algorithm flux at first data • New pattern recognition & fitting • Without trashing interconnections • Layered physics tools • Adding another persistency mechanism • Usually due to abstract data types & information hiding • Physicists connect with these well • Rather than more advanced techniques • "Where we have experienced problems, could it have been that we weren't OO enough?” • Quote is not from a computing specialist!

How do people learn these skills? • Some fraction of people will only read a C++ book and generalize • Often not interested to seek out BaBar-specific information • This has implications for system design • People haven’t even seen the recommended solutions! • Many will seek out information • Interested in learning design principles, vocabulary • We use commercial courses, as did not find anything better • HEP & BaBar specifics were taught informally

Battle-testing the architecture • Now solving new and harder problems • Real data and real use are not quite as expected • Despite useful results from Mock Data Challenges • Current work made possible by design choices a long time ago • Can architecture simultaneously flex and support weight? • Next slides discuss experience with key aspects: • Module/Event/Environment structure • Transient/persistent split • Rolling calibrations • Low-latency processing • Objectivity persistency

Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Module, event and environment structure - reminder • Modules provide the algorithms • Use existing information to create new objects • Styles range from procedural monoliths to OO castles • Framework/AC++ provides control & config • Uses TCL scripting, command line • Production executables run 300 modules • Objects have behaviors, not just values • “Networks of objects collaborate to provide semantics” • Internal form of our track objects is irrelevant • Objects kept in event and environment • Named access in a flat space • event -> Ifd<EmcCluster>::get(“MergedClusters”) • Implemented via ProxyDict • Proxies provide complex access when needed • Ensures physical decoupling

A success! • Linear processing model well suited to production work • Command-line configuration vital for development • Configuration issues at this size • Largest executables are becoming rigid due to amount of configuration • TCL does work at this scale, but we have to invest in cleaning up • Ad-hoc application setup hard to maintain • Tools to deal with this have not been a high priority • Configuration dump/restore • Configuration tool with knowledge of prerequisites and large-scale options? • Event/Environment model works well • Average collaborator completely shielded from underlying access • Have been able to add deferred I/O, caching, context control

Persistent - transient split • Good points: • ”Pointers like you read in a C++ book" are valuable • That's all many people will want to know • Performance gain at reference-time • BaBar objects are small, with lots of pointer interconnections • Allows complexity in the transient model, independent of persistent model • Necessary for reprocessing and replication • Fine control • Partial read, incremental read possible • Handling semantics (schema) changes • Proven effective in Kanga project • Built by non-experts • Bad points: • Performance cost at I/O time • Adds 1-2 msec + 15% to access • Significant for lightweight, high-speed processing • Creation of smartest scribes requires some effort • Now believe structure robust enough to move to proxied access for some data

Prompt reconstruction • Rolling calibration • Technically difficult • Scatter/gather over entire farm • Requires automation • Difficult organizationally • Crosses many lines • Necessary for BaBar • Low-latency processing • Causes much entropy during experiment startup • Reprocessing needed to understand initial data • But farms and people busy processing newly arrived data • How to balance new and old? Run 1 Run 2 Run 3

Objectivity persistance C103 • Objectivity, OO databases, event store are three different things • Objectivity • Commercial product, strengths and weaknesses • BaBar-provided licenses in use at 30 institutions in 6 countries • OO database • BaBar has 508 persistant classes, developed by about 60 people • Works well for online, conditions, config information • Event store • Currently holding 33TB of data in 28000 collections on 12 servers • Typically 90 simultaneous users at SLAC • 430 people have used it • Provides a number of significant possibilities & questions • Can it be matched to the sequential access demands? • Is drill-down analysis worthwhile? • Not yet a proven concept

Prompt reconstruction production • Steady-state rate pretty good • But still problems • Startup/shutdown time • BaBar takes short runs • Running fraction A288

Got here via program of optimization C110

Analysis running is more complicated C103 Scheduled outages Unscheduled outages

Data distribution inside and outside of SLAC • Local swapping via HPSS is a success • Regional centers have invested in making this work • Problem at remote universities • size • skill and effort • esp. for MC production • "Here's a tape, you deal with it” • Remote MC production • Very hard to set up remote production • Too much SLAC-specific context has crept into the system • Support for local infrastructure not generally available • Large-scale production is resource intensive C372

Collections and tag bits • A collection is a set of events • Gives direct access to all the parts of the event • Created during processing/scanning/reprocessing • Collection gathers together the results of partial event reprocessing • Novel concept for physics analysis • Usage is rapidly ramping up, with 28k now • Requires organization to use these collaboratively • Collection maintains “Tag” quantities • 400+ tag quantities now - bits and values • Logical operations allow faster scans • Relentless pressure to increase size of Tag data

Roles for ROOT • “Kanga” project: use ROOT I/O to store micro-DST • Added late, as plug-in to existing system • Limited-function copy of some conditions values • Allows ROOT/Objectivity hybrid • Middle-road solutions suffers by comparison • Interactive used in several ways in analysis • Ntuple and histogram tool • Pico-Analysis-Framework (PAF) • Separate classes from rest of system • Access via replicated interfaces to underlying analysis data • How get access to the functionality of the objects? • ROOT interactive analysis for non-ROOT event store? • Requires deep understanding of object relationships

CORBA/Java/XML D290 D161 • Highest-tech lives in the online system • Much more literate, homogenous group • Offline use limited to event display and browsers • WIRED display • CORBA servers • Access to central event store F118 B374

The concept of software project management • Something new in this generation of experiments • Bigger systems are possible/necessary now, and they sop up all the technical gains • Example: BaBar analysis tools run in production • Still learning how to do this • Similar, yet different from hardware projects • BaBar’s matrix organization by system and computing area • Which way will people sign up in the beginning? • Which is more stable in the long run? • "Data handling" as respected subject • Collaborations paying attention in advance • Bookkeeping critical to success • Robots allow access to raw data, instead of waiting for yearly bulk reprocessing • Big issue for off-site work - is this really getting better? • "In art, intentions are not enough. What counts is what one does, not what one intends to do.” - Pablo Picasso

Code Management & Release Management • CVS - SoftRelTools approach • Collaboration-wide read/write access to code in CVS • Organized as 630 packages • We don’t attempt to keep the HEAD production quality • Package coordinators • One per package • Tags and announces when new version ready for use • Build periodic releases from these tagged versions • Integration and testing to production now takes two weeks • “100KLOC is easy; we know how to do 1MLOC; 10MLOC is hard” • Examples of what we've had trouble with • Transition to "use & production" instead of development • Introduced a more reliable (rigid) one month cycle • Imposing a freeze on processing code now to create summer CP violation sample • Cannot imagine getting this "right” • New issue - runtime environment management

Distributed multi-platform development • Use native compilers & tools on Sun/Solaris, DEC/Compaq and Linux • Code to a common subset, empirically enforced • We're still waiting for the compiler promised land • Recent migration to Linux was interesting • Still need to think about issues beyond C++ semantics/syntax • E.g. template instantiation, inline tricks • Ongoing problems with STL, bool • Complete builds take days • Especially with optimization • Poor interactions with templates • People keep saying compilers are getting better. • It is not happening fast. E309

Collaborations2 • We collaborate with a number of others • GEANT4, RD45 • CDF • CLHEP / ZOOM • JAS, ROOT • Generally most successful as intellectual collaborations • Work on areas of common concern • BaBar timescale often forces different approaches • Limited common code development • Truly unfortunate how hard it is to share code • “You can’t avoid choosing base classes” • Net result is continuing reimplementation of common tasks • Are we missing some simple technology for this?

Connection to GEANT4 • GEANT4 simulation is a critical part of our strategy • Need integration with rest of system • GEANT3 is not a good neighbor • “BOGUS”, BaBar G4 sim, works at several levels • “Detailed BOGUS” is replacement for “bbsim”, our G3 simulation • “Fast Bogus” is replacement for ASLUND, our smeared simulation • “Very Fast Bogus” fills a new niche • All three are in use, but not yet default • We consider GEANT4 a success • Good interactions! • Able to build real products with it! • It has been a long, major effort, and its not done yet • Still not as reliable as GEANT3 • (G3 had a headstart) • Concerned about continued evolution

Connection to RD45 • What RD45 thinks is important: • Single federation image across the collaboration • Direct access to objects, removing need for event catalog, etc. • What BaBar thinks is important: • Robustness • Ability to evolve • Timing • We encountered issues they didn’t appreciate • How do people test, including making mistakes? • What if somebody leaves a lock on a DAQ container? • AMS limited to 1024 open files • Only 64,000 files/databases of 2GB/10GB per federation • Single threaded AMS, 400 collaborators • We find it useful to cooperate, but hard to use each other’s code

Is analysis different? • What people are used to: “I read the beamspot from CCC. I get the track parameters, then displace the origin to the observed beamspot with the XXX subroutine, then use the new d0 as my miss distance, giving it a sign using John’s version of YYY” • Physicists quite comfortable with this detail, and will ask for it if they don’t get it • “I called signedDoca(evt->foundSpot())” • Generates lots of questions with unpopular answers • Physicists think with "models" => system of equations & constants/values which doesn't do anything. You use them by plugging in numbers and calculating. • This leads to a deep misunderstanding of "objects”, resulting in a procedures & structures approach. • Invoking member functions with unknown implementation feels very different from passing formulae via email/paper, then implementing them • Will invoking member functions ever replace passing formulae around?

The physicists desktop in BaBar • Production code writes histograms, ntuple using HepTuple interface • In theory, allows replacement of downstream analysis package • We have PAW (HBOOK), ROOT, JAS implementations • But keeping enough functionality makes HepTuple a moving target • Java Analysis Studio • Used for some aspects of online presenters • Some partisans are using it in offline • ROOT • Used for some aspects of online presenters • Some partisans are using it offline • Some people write FORTRAN to manipulate ntuples • Mostly, people use PAW

The real issues: • How to give downstream tools the complete set of capabilities? • How can we access full power of offline for analysis? • How to put tools in the analysts hands? • (Partial) reconstruction & drill-down analysis • Visualization of calculations - code, results, processing • Access to TeraEvents both fast and in detail • Very hard to do the entire phase space!

But the bottom line is: It works B -> J/y K± B -> J/y Ks

What has BaBar learned?