290 likes | 395 Views
ALICE and the LCG Service Data Challenge 3. P. Cerello (INFN – Torino) LCG-SC Workshop Bari May 27 th , 2005. ALICE 2005 Physics Data Challenge. Number of events ( preliminary, they will increase ) Simulation 30,000 Pb-Pb (80% of central) - 24,000 equiv.
E N D
ALICE and the LCG Service Data Challenge 3 P. Cerello (INFN – Torino) LCG-SC Workshop Bari May 27th, 2005
ALICE 2005 Physics Data Challenge • Number of events (preliminary, they will increase) • Simulation • 30,000 Pb-Pb (80% of central) - 24,000 equiv. • 100,000 Pb-Pb (50% central) - 60,000 equiv. • 100,000 p-p - 1,000 equiv. 6-12h x 85,000 equiv. -> 0.5 - 1 Mh • Reconstruction • Much quicker -> 15 Kh • Assume 1,000 CPUs: • Reconstruction: 1 day • Simulation: 0.5-1 Kh each (25-50 days occupancy) LCG-SC Workshop - Bari
ALICE 2005 Physics Data Challenge • Physics Data Challenge • Until July 2005, simulate MC events on available resources • Register them in the ALICE FC and store them at CERN (for SC3) and other SEs (for analysis) • Coordinate with SC3 to run our Physics Data Challenge in the SC3 framework LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • Goal: • test of data transfer and storage services (SC3) • test of distributed reconstruction and data model (ALICE) • Use case 1: RECONSTRUCTION • Get “RAW” events stored at T0 from our Catalogue • Reconstruct at T0 (at least partially) • Ship from T0 to T1’s (goal: 500 MB/S out of T0) • Reconstruct at T1 with calibration data • Store/Catalogue the output LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • Use Case 2: SIMULATION • Simulate events at T2’s • Same as previous replacing T0 with T2 LCG-SC Workshop - Bari
ALICE & LCG SC3: Possible Data Flows • A) Reconstruction: input at T0, run at T0 + push to T1s • Central Pb-Pb events, 300 CPUs at T0 • 1 Job = 1 Input event • Input: 0.8 GB on T0 SE • Output: 22 MB on T0 SE • Job duration: 10 min -> 1800 Jobs/h -> 1.4 TB/h (400 MB/s) from T0 -> T1s • B) Reconstruction: input at T0, push to T1s + run at T1s • Central Pb-Pb events, 600 CPUs at T1s • 1 Job = 1 Input event • Input: 0.8 GB on T0 SE • Output: 22 MB on T1 SE • Job duration: 10 min -> 3600 Jobs/h -> 2.8 TB/h (800 MB/s) from T0 -> T1s LCG-SC Workshop - Bari
ALICE & LCG SC3: Possible Data Flows • Simulation • Assume 1000 CPUs availability at T2s and central Pb-Pb event • 1 Job = 1 Event • Input: few KB on ALICE TQ • Output: 0.8 GB on T1 SE • Job duration: 6 h -> 4000 Jobs/day -> 3.2 TB/day (40 MB/s) from T2s -> T1s • Remark • in case T2 resources were not sufficient, we could obviously simulate at T1 as well LCG-SC Workshop - Bari
ALICE and LCG Service Challenges • In other words: • Mimic our data (raw-like + simulated) flow • test the reconstruction • measure the performance of the SC3 services/components: • Data transfer efficiency • Storage Element efficiency • As new sites keep coming in, increase the scale of the exercise • As new middleware comes in, add more functionality LCG-SC Workshop - Bari
Resources Resources Resources Resources NU Grid OSG ALICE TQ ALIROOT ALICE user ROOT ALICE & Grid Services ALICE Agents & Daemons ALICE Agents & Daemons Computing framework LCG-SC Workshop - Bari
Storage management services Based on SRM as the interface gridftp Reliable file transfer service File placement service – perhaps later Grid catalogue services Workload management CE and batch systems seen as essential baseline services, WMS not necessarily by all Grid monitoring tools and services Focussed on job monitoring – basic level in common, WLM dependent part VO management services Clear need for VOMS – limited set of roles, subgroups Applications software installation service From discussions added: Posix-like I/O service local files, and include links to catalogues VO agent framework Baseline services Courtesy of I. Bird, LCG GDB, May 2005 LCG-SC Workshop - Bari
FTS summary Courtesy of I. Bird, LCG GDB, May 2005 • ALICE: • See fts layer as service that underlies data placement. Have used aiod for this in DC04. • Expect gLite FTS to be tested with other data management service in SC3 – ALICE will participate. • Expect implementation to allow for experiment-specific choices of higher level components like file catalogues LCG-SC Workshop - Bari
Catalogues Courtesy of I. Bird, LCG GDB, May 2005 • Generally: • All experiments have different views of catalogue models • Experiment dependent information is in catalogues • All have some form of collection (datasets, …) • All have role-based security • May be used for more than just data files • Interfaces • WMS (e.g. Data Location Interface/Storage Index) • gLite-I/O or other Posix-like I/O service • ALICE: • Distributed & Not Replicated (Alien) file catalogue LCG-SC Workshop - Bari
VO “Agents & Daemons” Courtesy of I. Bird, LCG GDB, May 2005 • VO-specific services/agents • Appeared in the discussions of fts, catalogs, etc. • This was subject of several long discussions – all experiments need the ability to run “long-lived agents” on a site • E.g. LHCb Dirac agents, ALICE: synchronous catalogue update agent • At Tier 1 and at Tier 2 • how do they get machines for this, who runs it, can we make a generic service framework • GD will test with LHCb a CE without a batch queue as a potential solution LCG-SC Workshop - Bari
ALICE SC3 layout LCG UI/RB OSG UI/RB ARC UI/RB ALICE TaskQ ALICE File Catalogue User (Production Manager) AliEn site ARC site LCG site OSG site AliEn CE ARC CE LCG CE OSG CE WN WN WN WN ALICE VO Box ALICE VO Box ALICE VO Box ALICE VO Box Data Registration LCG-SC Workshop - Bari
ALICE SC3 layout LCG UI / RB LCG Site ALICE TaskQ LCG CE ALICE File Catalogue ALICE VO Box WN Job submission User (Production Manager) Data Registration LCG-SC Workshop - Bari
ALICE SC3 layout / Italy LCG UI / RB LCG Site ALICE TaskQ LCG CE ALICE File Catalogue ALICE VO Box WN Job submission User (Production Manager) How many? Data Registration LCG-SC Workshop - Bari
ALICE SC layout - seen from WN LCG Site ALICE TaskQ LCG CE LCG RB ALICE File Catalogue AliEn SE WN 1: Pull Job Configuration from Task Queue 2: Get Input File(s) PFN(s) 4 3: Retrieve Input File(s) SE 5: Output Data Registration LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • What would we need for SC3? • AliRoot/ROOT etc. deployment on SC3 sites - ALICE • AliEn Top Level Services(ongoing) - ALICE • UI(s) for submission to LCG/SC3 - ALICE/LCG • WMS (n RBs) + CE/SE Services on SC3 - LCG • Computing/storage resources pledged for ALICE - LCG • Appropriate JDL files for the different tasks - ALICE LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • Sites/Resources • T1s: CC-IN2P3, CERN, CNAF, FZK, NIKHEF, RAL • T2s: Torino + GSI, LNL, … LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • PDC2005 is aimed at evaluating the performance of SC3 services in realistic conditions • We need support from LCG and we are putting some of our manpower in for SC3 • We are willing to start as soon as possible so as to be ready for… • …PDC2006 & SC4, where the MW components of the “LCG Production Service” will be tested LCG-SC Workshop - Bari
ALICE & LCG Service Challenge 3 • PDC2005 is aimed at evaluating the performance of SC3 services in realistic conditions • We need support from LCG and we are putting some of our manpower in for SC3 • We are willing to start as soon as possible so as to be ready for… • …PDC2006 & SC4, where the MW components of the “LCG Production Service” will be tested LCG-SC Workshop - Bari
First gLite tests in Bari & Torino • Set up of a test gLite RB & CE in Torino • test job submission, interaction with the ALICE file catalogue and, gradually, other pieces of the framework • Thanks to S. Bagnasco , R. Brunetti, F. Nebiolo • Tests of storage and data management components in Bari • dCache+SRM, FTS • To be integrated with the Torino setup to build a full testbed • Thanks to G. Donvito, N. Fioretti, F. Minafra LCG-SC Workshop - Bari
Catalog INFNGRIDProd BDII LFC Clients First gLite production S. Bagnasco, E. Bruna, F. Prino Small MC production (initially no ALICE-specific components except AliRoot) to gain confidence with the RB: 20k events, ~6TB, to be performed during next weeks on the INFN infrastructure grid007.to.infn.it gLite 1.1 RB+LB gLite 1.1 UI LCG-SC Workshop - Bari
First gLite tests in Bari & Torino • Problems: • The gLite UI command does not interact correctly with the VOMS. • known problem, fixed but the fix did not get through to the release (not even 1.1) • Submission to the gLite RB fails with certificates mapped to a SGM (software manager) account (Savannah bug #8616) • gLite RB interacts correctly with LCG 2.4.0 CEs but: • 650 jobs submitted to INFNGRID just after the upgrade to 2.4.0 showed the same toothing problems of last year, e.g.: • Hanging NFSs make software area inaccessible (this is a nasty one – remember the “Black Hole Effect”!) • Problems with environment configuration on WNs • Success rate: 423 (65%), not uniform on different sites • The support responsiveness definitely improved • Problem generally solved within an hour of submitting the ticket LCG-SC Workshop - Bari
First gLite production Jobs sent to gLite RB (grid007.to.infn.it): 1000, to LCG 2.4.0 on INFN-Grid CEs: Scheduled: 26 Running: 11 Completed: 661 (68%) Aborted: 72 ( 8%) Error: 230 (24%) Location: 33 SNS, 70 Le, 35 unknown, 43 CNAF, 25 Pd: 206 AliROOT crash: 28 NFS crash: 143 WN disk space < 4GB: 55 Other (not understood yet) 4 RB problem: 100 (1 bunch) job destination lost LCG-SC Workshop - Bari
courtesy of L .Gaido • Service Challenge 3 at Tier2-Torino • Reference experiment: ALICE • Available resources: • Staff: • 1 FTE provided by the Computing Service (S. Lusso) • 1 FTE by the experiment (M. Sitta) • support by other people from the grid service and from the experiment • Hardware: • about 30 CPUs (Xeon 2.4 and 3.06 GHz) for the first phase • 2 TB disk • up to about 80 CPUs during 2006 LCG-SC Workshop - Bari
courtesy of L .Gaido • Service Challenge 3 at Tier2-Torino • Local Infrastructures: • computing room capable of hosting a Tier-2 prototype • electric power: upgraded in 2004 • air conditioning: upgraded in 2004 • network: • 1 Gb/s link to GARR network • Gigabit Ethernet connection to GARR-G Point of Presence • (hosted by INFN, same building) LCG-SC Workshop - Bari
courtesy of L .Gaido Service Challenge 3 at Tier2-Torino CERN Tier-2@Torino 1Gb Cisco 7304 GARR Network 1Gb Extreme Summit 400 1Gb Tier-1@CNAF 1Gb Tier-2 Farm LCG-SC Workshop - Bari
courtesy of L .Gaido • Service Challenge 3 at Tier2-Torino • Other resources: • strong involvment in the EGEE SA1 activities • participation to the InfnGrid/LCG/EGEE grid production infrastructure since the beginning • expertise about grid middleware & management LCG-SC Workshop - Bari