130 likes | 230 Views
The PHysics Analysis SERver Project (PHASER). M. Bowen, G. Landsberg, and R. Partridge* Brown University. CHEP 2000 Padova, Italy February 7-11, 2000. What is the PHASER project?. Effort to substantially increase productivity of physicists analyzing multi-TB summary data sets
E N D
The PHysics Analysis SERver Project(PHASER) M. Bowen, G. Landsberg, and R. Partridge* Brown University CHEP 2000 Padova, Italy February 7-11, 2000
What is the PHASER project? • Effort to substantially increase productivity of physicists analyzing multi-TB summary data sets • Our immediate focus is on the DØ experiment • 600 million data events/year starting in early 2001 • Summary data set expected to grow at rate of 3TB/year • Concentrate on event selection and “ntuple” creation stage • transition in data handling from monolithic reconstruction processing to the much more chaotic processing of summary data by many physicisits • IO and CPU intensive due to need to apply latest calibration, particle ID, and event selection algorithms to several hundred million events Richard Partridge
PHASER Architecture • Physics Object Database (POD) stores meta-data used by most physics analyses for their initial event selection • Physics Object and Particle ID tables in POD store calibrated 4-vectors, object quality variables, and results of particle ID algorithms • DVD storage of full summary (mDST) data set and useful subsets of larger DST and STA data sets Richard Partridge
PHASER is PHast • New calibrations and particle ID algorithms can be quickly incorporated • Only the changes need to be importd • Regenerating the large mDST data set will only be done infrequently • Storage of up-to-date calibrations and particle ID algorihtms avoids the need to re-apply these alogorithms for each event selection pass • Particle ID tables are small, making it possible to quickly eliminate events not having the desired set of physics objects • Direct access to full mDST sample on DVD allows a mDST subset to be quickly generated for advanced analyses developing new algorithms not yet in the database Richard Partridge
The Physics Object Database (POD) • Stores fully calibrated meta-data associated with the various physics objects • leptons, photons, jets, missing ET, secondary vertices, triggers, etc. • for example, an electron object would have the energy, direction, and various quantities used in the electron ID algorithms stored • Each physics object associated with a table in a relational database • Primary key uniquely identifies each physics object and provides information needed to correlate physics objects from a single event • Currently use Run, Event, Instance (where appropriate) and row number from ntuple used to load database • Alternative: data source index, sequence number, and instance Richard Partridge
Why use a Relational Database? • Physics objects typically have a fixed set of attributes used for event selection and analysis • Independence of tables aids loading, updating database • Data can be “bulk loaded” as long as primary key is provided in input data stream • Several vendors with quite capable products, large commercial market Richard Partridge
Prototype POD • Use DØ Run 1 data (1992 - 1996 running period) • 62 million events loaded into the database • Entire “All-Stream” data set loaded • Data set used by almost all DØ physics analyses • Only files with special processing or trigger conditions excluded • Column-wise ntuple format used for importing/exporting data Richard Partridge
DØ Run 1 POD • Including indexes, Run 1 POD occupies ~100 GB • 58% physics object data • 18% indexes on object ET • 12% primary keys • 12% database overhead Richard Partridge
POD Benchmarks • Z e+e- candidate event selection: • 7 seconds to identify ~6k events • W en candidate event selection: • 18 seconds to identify ~86k events • Both benchmarks times make use of particle ID tables • Event selection times compare very favorably with ~1000 CPU hours required to generate ntuples used in this study Benchmark Hardware/Software • 450 MHz dual-processor Pentium II with 256 MB RAM • Database stored on (6) 36 GB disks in Raid 0 stripe set • MS SQL Server running on Windows NT 4.0 Richard Partridge
DVD Storage • Provide access to additional event information not included in POD • DVD-RAM has a number of unique capabilities • Less expensive than disk storage, doesn’t require backup • Access to individual events is much faster than tape storage • Current disk capacity is 2.6 GB, 4.7 GB expected soon • Commercial DVD libraries hold up to 600 DVD disks • 2.8 TB capacity using 4.7 GB DVD-RAM disks • Average disk load time of 4.5 s, <1 hour to cycle through 600 disks • Up to 6 DVD-RAM drives gives ~10 MB/s IO rate Richard Partridge
Web Interface • Plan to develop web-based user interface • Interface modelled on “3-tier” architecture widely used in commercial applications • Physicist will enter event selection requirements using a Java applet • Applet communicates request to “Physics Intelligence” middleware running on PHASER system (via CORBA) • Translate request to SQL for event selection • Verify that request can be accommodated within resource constraints • Produce the requested output files Richard Partridge
PHASER Output • Several output options: • List of run and event numbers satisfying the request • Ntuple created from POD information • mDST stream containing requested events from DVD library • Output files will generally be small enough to transfer over the network • Larger output files can be written to DVD and physically sent to physicist for further analysis Richard Partridge
Conclusions • PHASER offers a way for both experts, novices, and “dinosaurs” to quickly extract information about a particular class of events • Feasibility of loading “Run 1” size physics object info into a relational database has been demonstrated • Significant improvements in event selection time has been observed for W/Z benchmarks • Expect these results will scale up to Run 2 data load • Database technology is also potentially useful for helping manage complex analyses and storing intermediate results Richard Partridge