420 likes | 547 Views
BaBar Status for the RD45 Workshop. Jacek Becla David Quarrie. Stanford Linear Accelerator Center. Performance Requirements (1). Online Prompt Reconstruction Baseline of 200 processing nodes 100 Hz total (physics plus backgrounds) 30 Hz of Hadronic Physics Fully reconstructed
E N D
BaBar Statusfor the RD45 Workshop Jacek Becla David Quarrie Stanford Linear Accelerator Center
Performance Requirements (1) • Online Prompt Reconstruction • Baseline of 200 processing nodes • 100 Hz total (physics plus backgrounds) • 30 Hz of Hadronic Physics • Fully reconstructed • 70 Hz of backgrounds, calibration physics • Not necessarily fully reconstructed • most recent goal: 100 Hz on average • by mid-March
Performance Requirements (2) • Physics Analysis • DST Creation • 2 users at 109 events in 106 secs (1 month) • DST Analysis • 20 users at 108 events in 106 secs • Interactive Analysis • 100 users at 100 events/secs
Production Federations (1) • Developer Test • Dedicated server with 500GB disk & two lockserver machines • Saturation of transaction table with a single lock server • Test federations typically correspond to BABAR software releases • 5 federation ids assigned per developer • Space at a premium – separate journal file area • Shared Test • Developer communities (e.g. reconstruction) • Share hardware with developer test federations • Space becoming a problem – dedicated servers being setup • Online (IR2) • Used for online calibrations, slow controls information, configurations • Servers physically located in experiment hall
Production Federations (2) • Online Prompt Reconstruction (OPR) • Pseudo real-time reconstruction of raw data • Designed to share IR2 federation as 2nd autonomous partition • Intermittent DAQ run startup interference caused split • Still planned to recombine • Input from files on spool disk • Decoupling to prevent possible deadtime • These files also written to tape • 100-200 processing nodes • Design is 200 with 100Hz input event rate • Output to several database servers with 3TB of disk • Automatic migration to hierarchical mass store tape
Production Federations (3) • Reprocessing • Clone of OPR for bulk reprocessing with improved algorithms etc. • Reprocessing from raw data tapes • Being configured now – first reprocessing scheduled for March 2000 • Physics Analysis • Main physics analysis activities • Decoupled from OPR to prevent interference • Simulation Production • Bulk production of simulated data • Small farm of ~30 machines • Augmented by farm at LLNL writing to same federation • Other production site databases imported to SLAC
Production Federations (4) • Simulation Analysis • Shares same servers as physics analysis federation • Separate federations to allow more database ids • Not possible to access physics and simulation data simultaneously • Production Releases • Used during the software release build process. • One per release and platform architecture • Share hardware with developer test federations • Testbed federations (up to 6) • Dedicated servers with up to 240 clients • Performance scaling as function of number of servers, filesystems per server, cpus per server, and other configuration parameters
Integration With Mass Storage • Different regions on disk • Staged: Databases managed by staging/migration/purging service • Resident: Databases are never staged or purged • Dynamic: Neither staged, migrated or purged • Metadata such as federation catalog, management databases • Frequently modified so would be written to tape frequently • Only single slot in namespace but multiple space on tape • Explicit backups taken during scheduled outages • Test: Not managed. • Test new applications and database configurations • Analysis federation staging split into two servers • User: Explicit staging requests based on input event collections • Kept: Centrally managed access to particular physics runs
Movement of data between FDs • Several federations form coupled sets • Physics(Online, OPR, Analysis) • Simulation (Production, Analysis) • Data Distribution strategy to move dbs between fds • Allocation of id ranges avoids clashes between source & destination • Use of HPSS namespace to avoid physical copying of databases • Once a database has been migrated from source, the catalog of the destination is updated and the staging procedures will read the database on demand • Transfer causes some interference • Still working to understand and minimize • Two scheduled outages per week (10% downtime) • Other administrative activities • Backups, schema updates, configuration updates
Physicist Access to Data • Access via event collections • Mapping from event collection to databases • Data from any event spread across 8 databases • Improved performance to frequently accessed information • Reduction in disk space • Scanning of collections became bottleneck • Needed for explicit staging of databases from tape • Mapping known by OPR but not saved • Decided to use Oracle database for staging requests • Single scan of collection to databases mapping • Also used for production bookkeeping • Data distribution bookkeeping • On-demand staging feasible using Objy 5.2 • Has been demonstrated • Prefer explicit staging for production until access better understood
Conditions DB - splits • Conditions/Ambient Split • big win (conditions frequently imported to all federations) • data sample to import/export: 26 GB -> 2 GB • time to import full snapshot: >1 hour --> <25 min • ambient imported less frequently to ambient db • Another conditions split pending • to preserve rolling calibrations generated by OPR
CondDB - moving data @ slac • Copying frequency • 1/day to OPR, shortly will do 4/day • 2/week to analysis • 6 hour latency demonstrated • OPR -> IR2: ~ 1/week • during machine physics outages • driven by demand - expected to increase in future • Planned improvement • copy files (under different names) - no outage necessary • rename files - ~ 1 min “outage”
Data Distribution Tools • Current state • tools based on perl / tcl / shell / C++ • author left, impossible to maintain • will be rewritten in Java • Improvements in transfer rate • secure fast copy developed • ~ twice faster than ftp • next version will use asynchronous transfer • stay tuned!
CondDB - new features • “link database(s)” containing symbolic link(s) to index db(s) • supports potentially multiple index databases for the same component • “high water mark” • set a limit in the validity time to prevent propagating the newly stored condition from covering the rest (up to +oo) of conditions histories. • need to achieve reproducibility of conditions/calibrations • used for rolling calibrations • both in production
Platforms & Software Builds • Currently supported platforms • Sun Solaris, OSF • just started: Linux • Production builds every month • instead of twice a week • serious bug fixes builds allowed in between • Nightly builds (of libraries)
Production Schema Management • Crucial that new software releases compatible with schema in production federations • New software release is a true superset • Reference schema used to preload release build federation • If release build is successful, the output schema forms new reference • Following some QA tests • Offline and online builds can overlap in principle • Token passing scheme ensures sequential schema updates to reference • Production federations updated to reference during scheduled outages • Explicit schema evolution scheme described elsewhere
BaBar Software Review (Aug99) “BaBar’s choice of Objectivity was undoubtedly a very progressive decision and represents a real pioneering effort, which cannot realistically be expected to be implemented without problems. […] Recommendations: • Studies to understand and fix the current limitations on storing events from OPR and reading them during analysis should be allocated sufficient hardware and manpower resources • BaBar must organize an effort to provide a limited-function, short-to-medium term solution for micro-DST analysis using the standard BaBar analysis framework, but without Objectivity” • Next review: • 28 Feb 2000
Hardware in OPR • Servers (Sun E4500, 4 CPU) 1) two, each with one A3500 (500 GB each), LS, JNL 2) two, each with two A3500 (500 GB each), LS, JNL, catalog server • 4 AMSes per server (still true for multi-treaded AMSes) 3) three, each with two A3500 (800 GB each) • Clients 1) Sun Ultra5, 330 MHz, 256 MB • 50, then 100 2) 15 Feb 2000: replaced with Sun T1 (440 MHz, 256 MB, generally faster) • 30% improvements seen on simple benchmarks • performance point of view: reduce lock traffic :-)
Hardware in Test-bed • Initially • 2 data servers (Sun 450), LS, JNL, 100 clients (as in OPR) • Later • 230 nodes (scheduled access with OPR prod) • up to 6 data servers, catalog server
Current Limitation • LS saturation seen with 200 nodes • LS: not multi-threaded • #CPUs, clock speed of little help (tried 330 Hz -> 440 Hz) • Autonomous partitions - problems in objy5.2 • lock usage: already reduced to minimum • still could do better with a lot of effort (pre-creation) • Objy response • next release (early summer 2000) • reduced traffic to lock server • next major release: LS rewritten
Clustering in OPR • clustering data • Per component • redirect data per domain/authLevel/authName/component • never even: 4 physics streams, 6 FSs • changingratiobetweenphysicsstreams • Per client • each 1/6 of nodes has dedicated FS • Independent “database clusters” • do not let 200 nodes write to a single db
# databases in OPR • Total accumulated ~ 14 K • serialization on db • db = file -> vnode • page table updates • “active” dbs seen by one client • 1 event -> 8 active dbs • raw, rec, aod, esd, tag, rawhdr, col, evshdr • 4 physics streams -> 8 * 4 = 32 • 4 database clusters -> 32 * 4 = 128 • or 6 clusters -> 32 * 6 = 192
Achieving 100 Hz in OPR • Requirement: 100 Hz on average by mid-March • almost not feasible • current peak rate 128 Hz (testbed), but • efficiency on average ~ 20 Hz • Major issues/inefficiencies • Startup time • operational delays • “finalize” • rolling calibrations, writing histograms, etc
Objy 5.2 issues • AMS multi-threaded, but • much higher CPU usage: understood (thanks Andy) • serialization observed: possibly connected to cpu usage • staging works • can stage databases while other clients are using the AMS • migration @ slac • in phases (per fd) • no problem with 5.1 / 5.2 interoperability
Future Improvements • Bronco farm -> T1 • more file systems • improved Objy code • next release, LS rewritten • autonomous partitions • reducing payload per event • event size still too big • raw ~80 kB instead of 32 kB • rec 150 kB instead of 120 kB
Hardware for Analysis • Analysis fd • LS, JNL server, catalog server • 3 data servers • 6 file systems • SP2 analysis • LS, JNL server, soon catalog server • 3 data servers • 6 file systems
Optimizing Analysis • Placement • optimized: data rebalanced (several iterations) • added more data servers, catalog server • optimized access to tag & micro data • 35 Hz -> 2 kHz (iterating over tag) • expected x25 bandwidth relative to August ‘99 • more fileservers & filesystems • Dirk’s visit of significant help, next visit just after CHEP • Found performance problems in transient code • proxyDict, RogueWave, …
Miscellaneous (1) • Conditions spread across all file systems (OPR) • automatic load balance (AMS + fs cache) • automatic purging from disc cache • Working with Objy on LS monitoring, profiling... • new db using Unix cp + ooattachdb • only 1 catalog operation, much faster • dual lock servers for developers’ federations • Objy transaction table limit too low
Miscellaneous (2) • autonomous partitions still not used in production • Large file handling issues • Problems handling 10GB db files for external distribution • Running out of dbids quickly • working on reliable cleanup mechanism • extending containers is expensive operation • presizing containers
Miscellaneous (3) • no “urgent” problems/bugs since last RD45 workshop • DRO tried recently • unusable in production • Supporting external sites • tuning their systems, solving problems, advising • starts to be a significant burden on db group • Weekly meetings with Objy developer
Event Store Browser • Added multi-treading • thread-per-connection, or thread pool • in conjunction with objectivity contexts • improve scalability • Collections may be 'scanned' to obtain lists of associated databases. Establishes a link between the event collection browse tree and the database browse tree, very useful for distribution. • New EventID now includes run number and timestamp info. Cloned events can now be identified and traced back to parent collections. • Several GUI facelifts
Objy - Known Problems (1) • conversion functions • reading tags written on Sun from OSF or Linux • 2kHz (Sun), 0.5kHz (OSF or Linux) • CPU utilization: 30% (Sun), 90% (Linux) • problem for external sites • e.g. background events written on Sun @ slac • fixing not trivial, one time conversion tool could be provided shortly
Objy - Known Problems (2) • Objy 5.2 AMS - high CPU usage • understood at SLAC, pending fix • waitlock ignored • when APs used, under heavy load • introduced in 5.2 • bug in Active Schema • practically cannot use with BaBar schema • fixed, coming soon with next release
Future Improvements • next release (summer 2000) • performance improvements • threadSpec, ls traffic, cache management, reduce fsynch • discussed • large page size • painful upgrade, probably feasible <1 PB • long refs • read-only databases • pre-creating containers
Support Personnel • Two database administrators • Daily database operations • Develop management scripts and procedures • Data distribution between SLAC production federations • First level of user support • Two data distribution support people • Data distribution to/from regional centers and external sites • Two HPSS (mass store) support people • Also responsible for AMS back-end software • Staging/migration/purging • Extensions (security, timeout deferral, etc.) • Five database developers provide second tier support • Augmented by physicists, visitors
Database Statistics (Jan 2000) • Servers • 12 primary (database servers) • 15 secondary (lock servers, catalog & journal servers) • 513 persistent classes • Data • ~33TB accumulated data • ~14000 databases • ~28000 collections • Disk Space • ~10TB • Total for data, split into staged, resident, etc.
Database Statistics (2) • Sites • >30 sites using Objectivity (USA, UK, France, Italy, Germany, Russia) • Users • ~655 licensees (signed the license agreement) • ~430 users (created a test federation) • ~90 simultaneous users at SLAC • Monitoring distributed oolockmon statistics • ~60 developers • Have created or modified a persistent class • A wide range of expertise: 10-15 experts
Summary • Database performance • no longer seen as a major problem • OPR: satisfactory now, still improving • analysis: getting better (lack of manpower) • Still learning how to manage a large system • multiple federations, multiple servers, multiple sites • large user community, large developer community • Still more to automate - too many manual procedures • We seem to have the largest database in the world • and it is growing fast • does anybody else entered into multi-TB region yet?