170 likes | 277 Views
High-Level Data Access With Ease. Grace. Greg Landsberg Prague Software Workshop. Tower of Babylon! Plethora of Ntuple formats: QCD WZQCD Top LQ SUSY … Proprietary physics analysis code Number of nearly identical particle ID criteria
E N D
High-Level Data Access With Ease Grace Greg Landsberg Prague Software Workshop
Tower of Babylon! Plethora of Ntuple formats: QCD WZQCD Top LQ SUSY … Proprietary physics analysis code Number of nearly identical particle ID criteria Lack of standard documented ways to start a new physics analysis Result: a living hell for remote collaborators! How to avoid this in Run II? Enforce standard physics analysis packages Strongly enforce standard object ID Create and support standard heptuples Develop a number of WWW-friendly high-end analysis tools The ultimate product of this collaboration is a flow of high-quality papers, and we have to make it as easy as possible for remote physiciststo efficiently contribute to physics analyses Run I Lessons
The two key components are: Strong ID groups Strong management Strong Management No new analysis is allowed to start with a proprietary object ID If a new object ID is convincingly proved to be essential as a result of a new analysis, it should be immediately standardized and either be added to a list of accepted standards or replace one of the existing standards New efficiencies and fake probabilities are then calculated for a new ID Standard objects should be used in standard heptuples Standard Object ID • Strong ID groups: • Develop out-of-the-box particle objects and ID criteria • Provide enough versatility to satisfy different physics needs (e.g. low energy and high energy electrons) • Provide well optimized selection tools • Provide efficiencies and fake probabilities for standard ID cuts
To become a standard, these heptuples need to be introduced early on, before people go off and do MC-based physics analyses The time is now! They should be enforced (management) and supported (ID groups) Versatile enough to meet most of the physics goals Expandable to accommodate new analyses RCP/WWW-controlled user interface Sufficiently smaller than mDST, i.e. 2 KB/event The proposed format is based on my experience in analyzing Run I data; it’s just a starting point! Necessary components: Event tags Triggers Accelerator conditions Global quantities Basic physics quantities Electrons/photons Muons Jets t’s, b- and c-jets High-pT tracks … Standard HEPtuple Proposal
Standard HEPtuples details Triggers Accelerator Event Tags
Standard HEPtuples details (cont’d) Physics Quantities Photons Global Quantities
Standard HEPtuples details (cont’d) Electrons Muons Jets
Standard HEPtuples details (cont’d) b/c/t Jets High pT tracks
Slides from Richard Partridge (GCM talk, Seattle workshop) Physics Object Database
What is POD? • R&D project to investigate the use of database technology for physics analysis of large data samples • POD uses a commercial relational database program to store: • Calibrated physics objects (leptons, jets, ET) • Results of particle ID algorithms • Global quantities (triggers, vertices, etc.) • Database queries performs event selection • Example: select top em events by requiring 1 e, 1m, and 2 jets with |h| cuts and ET/ET thresholds • Query output is physics analysis input • Ntuple with database info for selected events • List of run/event numbers allow selected mDSTs to be quickly fetched for advanced analyses • Current goal is to demonstrate feasibility, develop necessary tools, and establish performance benchmarks using a database loaded with the Run 1 data sample
Why Should One Use a Database? • Designed to store, retrieve, update, and manage complex data samples • Large number of data types • Bits, integers, floats, characters, binary objects, etc. • Many ways of organizing data • Physics object, event, file, stream, run, etc. • Architecture allows fast access to data • Avoid reading/unpacking entire event to look at 1 bit • Separating algorithm results from physics object data eliminates need to look at all 600M events • Flexible access to data • Data, columns, tables, etc. can be added, updated, or deleted without recreating the database • Example: new calibrations/algorithms can be added to the database and compared to the old ones • Central location for latest calibrations, corrections, algorithms, etc. • Local processing, minimal network IO
POD Status • Database server running at Brown • Dual P-II/450 with ~40 GB available for testing • Using SQL Server for present tests • Preliminary studies using pseudo-data • 30M “electron” 4-vectors generated with flat ET, h, f distributions • ~1 minute to select 100K events satisfying restrictive cuts ET, |h| • 1.5 - 3.5 minutes to select ~16K events with 2 electrons satisfying loose ET, |h| cuts • For comparison, expect 2M produced Wen per fb-1 • While rather crude, these results suggest that the POD approach can increase the speed for event selection by several orders of magnitude • C++ program is being developed to load ntuples into the database • Heptuple used to read ntuples • ADO API used to write to database • Database is being loaded with Run 1 data (ALL stream), LQ-based ntuples
POD Tools Planned • Program to load heptuples into the database • Web interface to construct database queries that perform event selection • Provide web form for selecting desired physics objects, algorithms, kinematic cuts (ET, |h|, etc.), triggers, runs, etc. • Translate selection criteria into an SQL command • Save resulting event list, output ntuple • Ntuple generator to create heptuple of database variables for selected events • Web interface to define correspondence between heptuple and database columns
NT as POD Server OS • NT has proven ability to handle large databases • Supported by all leading database vendors • NT has best price/performance in standard database benchmarks • Good scalability in multi-processor systems (up to 8 P-III processors with forthcoming Profusion chip set) • NT supports ADO (Active Data Objects) • Provides high level API that greatly simplifies programming the database interface • ADO interfaces to all leading database products • Brown is using ADO to develop software for loading ntuples into a database for our prototype studies • Brown plans to also provide a web-based query capability based on ADO • NT makes setting up and managing a high performance / high reliability database remarkably easy • CD support for NT project servers not required
POD Server Requirements • Disk subsystem • Disk capacity determines how much info is stored • ~ 1 TB would allow ~ 1K of info/event • ~20 objects/event in Run 1 ~10 words/object • More info can be stored by adding disk space • Multiprocessor Server(s) • Large database queries are CPU (and disk) intensive • Queries execute in parallel on multiple CPUs • Goal would be to have a typical query selecting a small sub-sample in ~1 minute • Multiple servers can be clustered if needed • Optional DVD-RAM jukebox • Expect to be able to store ~2.8 TB in a single jukebox at 1/4 cost of disk space • Allows retrieval of full mDST event information for events selected by the database query POD Server is well matched to Project Server specs
DVD Jukebox Storage • DVD-RAM: 2.6 GB/side, 5.2 GB total, $15-25/DVD; 4.8 GB/side were just announced! • DVD libraries: 600 DVD, 3 TB of storage for about $45K or $15/GB! • 10-40 MB/s throughput • Very promising technology, potential capacity up to 17 GB/DVD • Fast price drop, wide availability • Brown group has purchased a single 1X DVD-RAM recorder for performance tests • Excellent tool for remote collaborators to have local copies of mDST data set (or selected STA/DST streams)
Remote Physicist Web-based GUI Possible POD/DVD Server 3 TB fast RAID storage Central Analysis Server POD Server Eight P-III or Merced CPU 1 GB memory Tape robot $30K/ 5000 MIP server DB DB DB 20-30 TB SCSI bus Fermilab $60/GB DVD-RAM Library 15 TB storage »$1M One or two (shown) fast server(s) T-3 or faster Network Fast 1Gbit/s Ethernet mDST DST STA DVD Server Eight P-III or Merced CPU 1 GB memory SCSI bus $15/GB SCSI bus T-1/T-3 Network 10-40 MB/sec Cache disk 600-DVD multi-drive changers A solution to the public data access provision H.R.4328 (or Quarknet)? POD/DVD Server at YOUR INSTITUTION