90 likes | 196 Views
Version 1.0 (pre-meeting draft) 16 October 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 5. Outline. News – (RDK has none this week) Action Items from last week Past Week/This Week Status Check
E N D
Version 1.0 (pre-meeting draft) 16 October 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 5 D0 Grid Data Production
D0 Grid Data Production Outline • News – (RDK has none this week) • Action Items from last week • Past Week/This Week Status Check • Initiative Schedule v0.8 (closer to baseline) • Deployment “Feature List”
D0 Grid Data Production News Items • Any?…
D0 Grid Data Production Open Action Items • Resource load schedule, include vacations and other leave (RDK) • Mostly done. Need to replace group with individual assignments… this week. Need to add status into schedule --- please, do not panic when you see “nothing” done! • Have not assigned fractions of people since “leveling” already done by hand. • Setup JIRA status reporting for critical path tasks, review staffing (AL) • Done, against schedule v0.6 • Investigate hardware for a FWD5 to be used at least until virtualization on FWD4 is deployed (AL) • Done. Will use “15” as FWD at least until virtualization deployed. Detailed in schedule v0.8. Backfill to be considered. • Plan to deploy Condor 7 earlier… make it happen (RDK, GG, AL) • Done. Will deploy patched “old” v1.10.1, replace with “new” v1.10.1 from UWisc at soonest deployment after it is ready and tested. • Integrate “Slow Job Transitions” experience into Plan. (RDK, AL) • Defer “add alarm” as non-critical, but highly desirable for later phase. • Add a task “increase FILEMAX to 16k on FWD#” to schedule. In schedule v0.6+
D0 Grid Data Production Past/This Week Tasks (1 of 2)(Red means a critical task chain, Green means effectively done, … Who is doing work?) • 1.1.1 Fwd4 Platform AL • 1.1.1.1 INPUT: Fwd4 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d (DONE) • 1.1.1.2 Fwd4 Server Hardware OS Install AL FEF Wed 10/1/08 Fri 10/10/08 8d • 1.1.1.3 Fwd4 Server Hardware Burn-in AL FEF Mon 10/13/08 Fri 10/17/08 5d • 1.1.2 Fwd4 Grid Service Software and Configuration AL • 1.1.2.1 Receive Fwd4 AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.2.2 Prepare Deployment 1 Fwd4 Configuration AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.3 Fwd5 Platform AL • 1.1.3.1 Identify FWD5 hardware: repurpose Test Fwd Node AL AL Tue 10/14/08 Tue 10/14/08 1d • 1.1.3.2 "Re-install OS on d0srv015, rename d0samgfwd5" AL FEF Wed 10/15/08 Fri 10/17/08 3d • 1.1.4 Fwd5 Grid Service Software and Configuration AL • 1.1.4.1 Receive Fwd5 AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.4.2 Prepare Deployment 1 Fwd5 Configuration AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.5 Que2 Platform AL • 1.1.5.1 INPUT: Que2 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d (DONE) • 1.1.5.2 Que2 Server Hardware OS Install AL FEF Wed 10/1/08 Fri 10/10/08 8d • 1.1.5.3 Que2 Server Hardware Burn-in AL FEF Mon 10/13/08 Fri 10/17/08 5d • 1.1.6 Que2 Grid Service and Client Software and Configuration AL • 1.1.6.1 Receive Que2 AL REX Thu 10/16/08 Fri 10/17/08 2d
D0 Grid Data Production Past/This Week Tasks (2 of 2) • (Continued from previous page) • 1.1.7 New Sam Station Platform AL • 1.1.7.1 Identify Hardware For Role AL FEF Wed 10/1/08 Fri 10/17/08 13d • 1.3 Small Quick Wins • 1.3.1 SAM-Grid Job Status Info GG • 1.3.1.1 Unlimited-time Proxy for Gridftps GG PM Mon 10/13/08 Wed 10/15/08 3d • 1.3.1.2 New Job Status at QUE Node GG PM Thu 10/16/08 Fri 10/31/08 12d • 1.3.3 Improved H/w Uptime AL Mon 10/13/08 Mon 10/13/08 1d • 1.3.3.1 "Consider FWD5: Full decoupling w/o virtualization, improved robustness to FWD node failures" AL AL Mon 10/13/08 Mon 10/13/08 1d (DONE) • May need to backfill the old test FWD node… or perhaps even running all “spares” in production for hot fall-back?
D0 Grid Data Production Today Th-Day Holiday Schedule v0.8 (Phase 1) Fwd 4 Prep Fwd 5 Prep Que 2 Prep SAM’ Prep Deploy 1 VDT “new” Deploy 2 Job Status Dev Filemax Metrics Summaries
D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services • Time frame: Nov 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd to all Merging • 1. Fwd4 deployed w/o virtualization • 2. Fwd5 deployed • 3. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 4. New SAM Station (moved off of FWD1) • 5. Condor 7 via “old” 1.10.1 + patches • New 1.10.1 official release from UWisc if available and tested in time • Deployment 2: Optimize Data and MC Production Configurations • Time frame: Dev 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. Condor 7 via “new” 1.10.1 official release from UWisc
D0 Grid Data Production Meeting Discussion Summary • …