120 likes | 210 Views
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011. Introduction. Caching at T2 using PD2P and Victor works well Have 6 months experience (>3 months with all clouds) Almost zero complaint from users Few operational headaches
E N D
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011
Introduction • Caching at T2 using PD2P and Victor works well • Have 6 months experience (>3 months with all clouds) • Almost zero complaint from users • Few operational headaches • Some cases of disk full, datasets disappearing… • Most issues addressed with incremental improvements like space checking, rebrokering, storage cleanup and consolidation • What I propose today should solve remaining issues • Many positives • No exponential growth in storage use • Better use of Tier 2 sites for analysis • Next step – PD2P for Tier 1 • This is not a choice – but necessity (see Kors’ slides) • We should treat part of Tier 1 storage as dynamic cache Kaushik De
Life Without ESD • New plan – see document and Ueda’s slides • Reduction in storage requirement from 27 PB -> ~10 PB for 2011 for data @ 400 Hz (but could be as much as 13 PB) • Reduction of 2010 data from 13PB to ~6 PB • But we should go farther • We are still planning to fill almost all T1 disks with pre-placed data • 2010+2011+MC = 6 + 10 + 8 = 24 PB = available space • Based on past experience, reality will be tougher, and disk crises will hit us sooner – we should do things differently this time • We must trust caching model Kaushik De
What can we do? • Make some room for dynamic caches • For discussion below, do not count T0 copy • Use DQ2 tags – custodial/primary/secondary – rigorously • Custodial = LHC Data = Tape only (1 copy) • Primary = minimal, disk at T1, so we have room for PD2P caching • LHC Data primary == RAW (1 copy), AOD, DESD, NTUP (2 copies) • MC primary == Evgen, AOD, NTUP (2 copies only) • Secondary = copies made by ProdSys (ESD, HITS, RDO), PD2P (all types except RAW, RDO, HITS) and DaTri only • Lifetimes – required strictly for all secondary copies (i.e. consider secondary == cached == temporary) • Locations – custodial ≠ primary; primary ≠ secondary • Deletions – any secondary copy can be deleted by Victor Kaushik De
Reality Check • Primary copy (according to slide 4) • 2010 data ~ 4 PB • 2011 data ~ 4.5 PB • MC ~ 5 PB • Total primary = 14 PB • Available space for secondaries > ~10 PB at Tier 1’s • Can accommodate additional copies, only if ‘hot’ • Can accommodate some ESD’s (expired gracefully after n months) • Can accommodate large buffers during reprocessing (new release) • Can accommodate better than expected LHC running • Can accommodate new physics driven requests Kaushik De
Who Makes Replicas? • RAW - managed by Santa Claus (no change) • 1 copy to TAPE (custodial), 1 copy DISK (primary) at a different T1 • First pass processed data – by Santa Claus (no change) • Tagged primary/secondary according to slide 4 • Secondary will have lifetime (n months) • Reprocessed data – by PanDA • Tagged primary/secondary according to slide 4, and set lifetime • Additional copies made to a different T1 disk, according to MoU share, automatically based on slide 4 (not by AKTR anymore) • Additional copies at Tier 1’s – only by PD2P and DaTri • Must always set lifetime • Note – only PD2P makes copies to Tier 2’s Kaushik De
Additional Copies by PD2P • Additional copies at Tier 1’s – always tagged secondary • If dataset is ‘hot’ (defined on next slide) • Use MoU share to decide which Tier 1 gets extra copy • Copies at Tier 2’s – always tagged secondary • No changes for first copy – keep current algorithm (brokerage), use age requirement if we run into space shortage (see Graeme’s talk) • If dataset is ‘hot’ (see next slide) make extra copy • Reminder – additional replicas are secondary = temporary by definition, may/will be removed by Victor Kaushik De
What is ‘Hot’? • ‘Hot’ decides when to make secondary replica • Algorithm is based on additive weights • w1 + w2 + w3 + wN… > N (tunable threshold) – make extra copy • w1 – based on number of waiting jobs • nwait/2*nrunning, averaged over all sites • Currently disabled due to DB issues – need to re-enable • Don’t base on number of reuse – did not work well • w2 – inversely based on age • Either Graeme’s table, or continuous, normalized to 1 (newest data) • w3 – inversely based on number of copies • wN – other factors based on experience Kaushik De
Where to Send ‘Hot’ Data? • Tier 1 site selection • Based on MoU share • Exclude site if dataset size > 5% (as proposed by Graeme) • Exclude site if too many active subscriptions • Other tuning based on experience • Tier 2 site selection • Based on brokerage, as currently • Negative weight – based on number of active subscriptions • Other tuning based on experience Kaushik De
What About Broken Subscriptions? • Becoming an issue (see Graeme’s talk) • PD2P already sends datasets within a container to different sites to reduce wait time for users • But what about datasets which take more than few hours? • Simplest solution • ProdSys imposes maximum limit on dataset size • Possible alternative • Cron/PanDA to break up datasets and rebuild container • Difficult but also possible solution • Use _dis datasets in PD2P • Search DQ2 for _dis datasets in brokerage (there will be performance penalty if we use this route) • But this is perhaps the most robust solution? Kaushik De
Data Deletions will be Very Important • Since we are caching everywhere (T1+T2), Victor plays equally important role as PD2P • Asynchronously cleanup all caches • Trigger based on disk fullness threshold • Algorithm based on (age+popularity)&secondary • Also automatic deletion of n-2 – by AKTR/Victor Kaushik De
How Soon Can we Implement? • Before LHC startup! • Big load initially on ADC operations to cleanup 2010 data, and to migrate tokens • Need some testing/tuning of PD2P before LHC starts • So, we need decision on this proposal quickly Kaushik De