130 likes | 228 Views
Nuclear Physics Data Management Needs Bruce G. Gibbard. SLAC DMW2004 Workshop 16-18 March 2004. Overview. Addressing a class of Nuclear Physics (NP) experiments utilizing large particle detector systems to study accelerator produced reactions Examples at: BNL (RHIC), JLab, CERN (LHC)
E N D
Nuclear Physics Data Management NeedsBruce G. Gibbard SLAC DMW2004 Workshop 16-18 March 2004
Overview • Addressing a class of Nuclear Physics (NP) experiments utilizing large particle detector systems to study accelerator produced reactions • Examples at: BNL (RHIC), JLab, CERN (LHC) • Technologies & data management needs of this branch of NP are quite similar to HEP • Integrating across its four experiments, the Relativistic Heavy Ion Collider (RHIC) at BNL is currently the most prolific producer of data • Study of very high energy collisions of heavy ions (up to Au on Au) • High nucleon count, high energy => high multiplicity • High multiplicity, high luminosity and fine detector granularity => very high data rates • Raw data recording at up to ~250 MBytes/sec B. Gibbard
Digitized Event In STAR at RHIC B. Gibbard
IT Activities of Such NP Experiments • Support the basic computing infrastructure for experimental collaboration • Typically large, 100’s of physicist, and internationally distributed • Manage & distribute code, design, cost, & schedule databases • Facilitate communication, documentation and decision making • Store, process, support analysis of, and serve data • Online recording of Raw data • Generation and recording of Simulated data • Construction of Summary data from Raw and Simulated data • Iterative generation of Distilled Data Subsets from Summary data • Serve Distilled Data Subsets and analysis capability to widely distributed individual physicists Data Intensive Activities B. Gibbard
Data Handling Limited B. Gibbard
Data Volumes in Current RHIC Run • Raw Data (PHENIX) • Peak rates to 120 MBytes/sec • First 2 months of ’04, Jan & Feb • 109 Events • 160 TBytes • Project ~ 225 TBytes of Raw data for Current Run • Derived Data (PHENIX) • Construction of Summary Data from Raw Data then production of distilled subsets from that Summary Data • Project ~270 TBytes of Derived data • Total (all of RHIC) = 1.2 PBytes for Current Run • STAR = PHENIX • BRAHMS + PHOBOS = ~ 40% of PHENIX B. Gibbard
RHIC Raw Data Recording Rate 120MBytes/sec PHENIX 120MBytes/sec STAR B. Gibbard
Current RHIC Technology • Tertiary Storage • StorageTek / HPSS • 4 Silos – 4.5 PBytes (1.5 PBytes currently filled) • 1000 MB/sec theoretical native I/O bandwidth • Online Storage • Central NFS served disk • ~170 TBytes of FibreChannel Connected RAID 5 • ~1200 MBytes/sec served by 32 SUN SMP’s • Distributed disk • ~300 TBytes of SCSI/IDE • Locally mounted on Intel/Linux farm nodes • Compute • ~1300 Dual Processor Red Hat Linux / Intel Nodes • ~2600 CPU’s => ~1,400 kSPECint2K (3-4 TFLOPS) B. Gibbard
Projected Growth in Capacity Scale • Moore’s Law effect of component replacement in experiment DAQ’s & in computing facilities => ~X6 increase in 5 years • Not yet fully specified requirements of RHIC II and eRHIC upgrades are likely to accelerate growth Disk Volume at RHIC B. Gibbard
NP Analysis Limitations (1) • Underlying the Data Management issue • Events (interactions) of interest are rare relative to minimum bias events • Threshold / phase space effect for each new energy domain • Combinatorics of large multiplicity events of all kinds confound selection of interesting events • Combinatorics also create backgrounds to signals of interest • Two analysis approaches • Topological: typically with • Many qualitative &/or quantitative constraints on data sample • Relatively low background to signal • Modest number of events in final analysis data sample • Statistical: frequently with • More poorly constrained sample • Large background (signal is small difference between large numbers) • Large number of events in final analysis data sample B. Gibbard
NP Analysis Limitations (2) • It seems that it is less frequently possible to do Topological Analyses in NP than in HEP so Statistical Analyses are more often required • Evidence for this is rather anecdotal – not all would agree • To the extent that it is true, final analysis data sets tend to be large • These are the data sets accessed very frequently by large numbers of users … thus exacerbating the data management problem • In any case the extraction and the delivery of distilled data subsets to physicists for analysis currently most limits NP analyses B. Gibbard
Grid / Data Management Issues • Major RHIC experiments are moving (have moved) complete copies of Summary Date to regional analysis centers • STAR: to LBNL via Grid Tools • PHENIX: to Riken via Tape/Airfreight • Evolution toward more sites and full dependence on Grid • RHIC, JLab, and NP at the LHC are all very interested and active in Grid development • Including high performance reliable Wide Area data movement / replication / access services B. Gibbard
Conclusions • NP and HEP accelerator/detector experiments have very similar Data Management requirements • NP analyses of this type currently tend to be more Data than CPU limited • “Mining” of Summary Data and affording end users adequate access (both Local and Wide Area) to the resulting distillate currently most limits NP analysis • It is expected that this will remain the case for the next 4-6 years through • Upgrades of RHIC and Jlab • Start-up of LHC with Wide Area access growing in importance relative to Local access B. Gibbard