130 likes | 250 Views
STAR grid activities and São Paulo experience. NERSC-PDSF ~ 500 CPU ~ 150 TB SGE batch. São Paulo Test cluster 10 CPU, 3 TB SGE batch. BNL (2 sites) ~ 1100 CPU ~ 400 TB LSF batch. Upgrade project ~ 50 CPU and ~ 40 TB. The size of the raw data. STAR Au+Au event statistics (raw)
E N D
STAR grid activities and São Paulo experience Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
NERSC-PDSF • ~ 500 CPU • ~ 150 TB • SGE batch • São Paulo • Test cluster • 10 CPU, 3 TB • SGE batch • BNL (2 sites) • ~ 1100 CPU • ~ 400 TB • LSF batch Upgrade project ~ 50 CPU and ~ 40 TB Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
The size of the raw data • STAR Au+Au event statistics (raw) • ~ 2-3 MB/event • ~ 20-40 events/s • Total 2004 Au+Au • 20-30 M events • ~ 65 TB • Cu+Cu run • ~ 70 M events @ 200 GeV • ~ 40 M events @ 62 GeV • ~ 4 M events @ 22 GeV • Plus all the p+p, d+Au and previous runs Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
The reconstruction, simulation, etc. • Reconstruction • Basically done in BNL • Au+Au is estimated to take 18 months (only 60% is complete) • Compare with 1 new run every year • A physics ready production needs ~ 2 production rounds (calibrations, improvements, etc) • Simulation and embedding • Done at PDSF • Simulation is transferred to BNL • STAR takes more data that it currently can make available for analysis Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
Analysis • Real data analysis is done in RCF • Simulation and embedding analysis is done in PDSF • Small fractions of datasets are scattered over many institutions mainly for analysis development @ PDSF Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
Why do we need grid? • If STAR wants to keep the production and analysis running in a speed compatible with data taking, other institutions need to share computer power • Next run STAR will take at least one order of magnitude more events than last year • The RCF/PSDF farm does not grow in the same rate • The user point of view • More time available for physics • Data will be available earlier • More computing power for analysis • Analysis will run faster • Submit the jobs from your home institution and get the output in there • No need to know where the data is • No need to log on RCF or PDSF • You manage your disk space Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
STAR grid • Three level structure • Tier0 sites (BNL) • Dedicated to reconstruction, simulation and analysis • Tier1 sites (PDSF) • Runs reconstruction on demand • Receives all the reconstructed files for analysis • Simulations and embedding • Tier2 sites (all other facilities, including São Paulo) • Receives a fraction of files for analysis • Eventually runs reconstruction depending on demand Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
Needs • Reconstruction and file distribution • Tier0 production • ALL EVENT files get copied on HPSS at the end of a job • Strategy implies dataset IMMEDIATE replication • As soon as a file is registered, it becomes available for “distribution” • 2 Levels of data distributions – Local and Global • Local • All analysis files are on disks • Notions of distributed disk – Cost effective solution • Global • Tier1 (all) and tier2 (partial) sites • Cataloging is fundamental • Must know where the files are • The only central connection between users and files • Central and local catalogs • Database should be updated right after file transfer • Customized scheduler • Find out where data is upon user request • Redirect jobs to cluster where data is saved • Job submission should not be random but highly coordinated with other users requests Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
What is STAR doing on grid? • For STAR, grid computing is EVERY DAY Production used • Data transfer using SRM, RRS, .. • We run simulation production on the Grid (easy) • Resource reserved for DATA production (still done traditionally) • No real technical difficulties • Mostly fears related to un-coordinated access and massive transfers • User analysis • Chaotic in nature, requires accounting, quota, privilege, etc … • Increase interest from some institutions • Already success under controlled conditions Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
STAR jobs in the grid Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
Accomplishments in the last few months • Full database mirrors over many institutions • Hold detector conditions, calibrations, status, etc… • Highly used during user analisys • File catalog and scheduler available outside BNL • User can query files and submit jobs using grid • Still some pitfalls for general user analysis • Integration between sites • Tools to keep grid certificates, batch systems and local catalogs updated • Library distribution automatically done using AFS or local copy (updated in a daily basis) • Full integration of the 3 sites (BNL, PDSF and SP) with OSG Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
User analysis in the grid • STAR analysis schema • 99% based on ROOT applications • User develops personal analysis code that process the data • Steps to properly submit analysis jobs in the grid • Select the proper cluster in the grid • Transfer and compile the analysis code to that cluster • Use the file catalog to select the files • Run the jobs (as many as necessary) • The node the job runs and the number of jobs is defined by the scheduler and depends on the cluster size, number of events and time to process each event. All this information is managed by the file catalog • Transfer the output to the local site • Many of these steps are not yet fully functional but progressing fast Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005
Current status and to do list • The GRID between PSDF and RCF works quite well • Mainly used for simulation jobs • São Paulo, BNL and LBL are fully integrated • Libraries, file catalog, scheduler, OSG, etc. • Being used to test user analysis under the grid • Activities for the next few months • Integrate the SGE batch system in the grid framework • Still some problems with respect to report right numbers to gridCat • Problems keeping jobs alive after few hours • Developments of authentication tools • RCF (BNL) and PDSF (LBL) are part of DOE labs • User analysis Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005