110 likes | 230 Views
Version 1.0 (pre-meeting edition) 23 October 2008 Rob Kennedy and Adam Lyon Attending: … Unable to Attend: EB, KC, ST, CB. D0 Grid Data Production Initiative: Coordination Mtg 7. Outline. Summary and News Open Action Items Deployment “Feature List”: drives what is critical
E N D
Version 1.0 (pre-meeting edition) 23 October 2008 Rob Kennedy and Adam Lyon Attending: … Unable to Attend: EB, KC, ST, CB D0 Grid Data Production Initiative:Coordination Mtg 7 D0 Grid Data Production
D0 Grid Data Production Outline • Summary and News • Open Action Items • Deployment “Feature List”: drives what is critical • Overview of Draft Baseline Schedule • Task Status (4 slides) • Metrics Summary Work
D0 Grid Data Production Summary and News • Summary: • (details to follow… just to set the tone) • Roughly on time, on “budget” • News: • Alain Roy (VDT Team) deliver an official release EARLY with just our needed fix. THANKS! • We can roll this into the Nov ‘08 deployment 1 instead of waiting until the Dec ‘08 deployment 2
D0 Grid Data Production Open Action Items(Green = effectively done, Yellow = added notes, Blue = coming week) • RDK: Baseline schedule with: • Resource names, Current status, Adjustments per feedback • Links to the related JIRA tickets (organized tasks to match, at higher level) • D0runjob upgrade task: need to understand what is involved: today? • Post schedule/plan and basic Gantt Chart to web-accessible area • This afternoon with posting of slides+notes, will send URLs. • AL/JA: Time Estimate for: 1.1.3.2 "Re-install OS on d0srv015, rename d0samgfwd5“: Done • AL/JA: Time Estimate for: 1.1.7.2 Repurpose, OS Install on new SAM head node: Done • RDK: Add Status Overview to initial slide with 1 sentence summary. (but we should avoid drilling down into it).
D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services • Time frame: November 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging • 2. Fwd4 deployed (w/o virtualization) • 3. Fwd5 deployed • 4. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 5. New SAM Station (moved off of FWD1) • 6. Condor 7 via “new” 1.10.1 official release from UWisc • 7. FileMax increase on all Fwd nodes to handle large nJob actions • Deployment 2: Optimize Data and MC Production Configurations • Time frame: December 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. D0Runjob Upgrade for Data Production (being conservative until better understood) • 3. New SAM-Grid Release with support for new Job status value at Queuing node
D0 Grid Data Production Today Th-Day Holiday Schedule v0.9.5 (Phase 1) Fwd 4 Prep Fwd 5 Prep Que 2 Prep (status?) SAM’ Prep Deploy 1 Deploy 2 VDT “new” D0Runjob+Job Status Dev Filemax Metrics Summaries
D0 Grid Data Production Task Status (1 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.1 Forwarding Node 4 (Fwd4) AL Wed 10/1/08 Mon 11/10/08 29d • 1.1.1.1 INPUT: Fwd4 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d • 1.1.1.2 Fwd4: Server Hardware OS Install AL FEF Wed 10/1/08 Thu 10/16/08 12d • 1.1.1.3 Fwd4: Server Hardware Burn-in AL FEF Fri 10/17/08 Fri 10/17/08 1d • 1.1.1.4 Fwd4: Verify Platform Installation AL JB Fri 10/17/08 Fri 10/17/08 1d • 1.1.1.5 Fwd4: Install VDT 1.10.1 "old"+patches AL JB Fri 10/17/08 Tue 10/21/08 3d • 1.1.1.6 Fwd4: Request and Install Grid Certs AL JB Tue 10/21/08 Wed 10/22/08 2d • 1.1.1.7 Fwd4: Install FWD-node-specific Components…AL JB Wed 10/22/08 Fri 10/24/08 3d • 1.1.1.8 Fwd4: Pre-Deployment As-Is Test AL JB Mon 10/27/08 Mon 11/3/08 6d • 1.1.1.9 Fwd4: Pre-Deployment FileMax=16k Test AL JB Tue 11/4/08 Mon 11/10/08 5d • 1.1.1.10 Milestone: Fwd4 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d • 1.1.2 Forwarding Node 5 (Fwd5) AL Tue 10/14/08 Mon 11/10/08 20d • 1.1.2.1 "Fwd5: d0srv015 Request Platform Prep" AL AL Tue 10/14/08 Tue 10/14/08 1d • 1.1.2.2 "Fwd5: Re-install OS on d0srv015" AL FEF Fri 10/17/08 Mon 10/20/08 2d • 1.1.2.3 Fwd5: Verify Platform Installation AL JB Tue 10/21/08 Tue 10/21/08 1d • 1.1.2.4 "Fwd5: Setup VDT 1.10.1 ""old""+patches" AL JB Tue 10/21/08 Thu 10/23/08 3d • 1.1.2.5 Fwd5: Request and Install Grid Certs AL JB Thu 10/23/08 Fri 10/24/08 2d • 1.1.2.6 Fwd5: Install FWD-node-specific Components…AL JB Fri 10/24/08 Tue 10/28/08 3d • 1.1.2.7 Fwd5: Pre-Deployment As-Is Test AL JB Wed 10/29/08 Mon 11/3/08 4d • 1.1.2.8 Fwd5: Pre-Deployment FileMax=16k Test AL JB Tue 11/4/08 Mon 11/10/08 5d • 1.1.2.9 Milestone: Fwd5 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d
D0 Grid Data Production Task Status (2 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.3 Queuing Node 2 (Que2) AL Wed 10/1/08 Mon 11/10/08 29d • 1.1.3.1 INPUT: Que2 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d • 1.1.3.2 Que2 Server Hardware OS Install AL FEF Wed 10/1/08 Mon 10/20/08 14d • 1.1.3.3 Que2 Server Hardware Burn-in AL FEF Tue 10/21/08 Tue 10/21/08 1d • 1.1.3.4 Que2: Verify Installation AL JB Tue 10/21/08 Wed 10/22/08 2d • 1.1.3.5 Que2: Setup VDT 1.10.1 "old"+patches AL JB Wed 10/22/08 Mon 10/27/08 4d • 1.1.3.6 Que2: Request and Install Grid Certs AL JB Tue 10/28/08 Wed 10/29/08 2d • 1.1.3.7 Que2: Install QUE-node-specific Components…AL JB Tue 10/28/08 Fri 10/31/08 4d • 1.1.3.8 Que2: Test w/1-QUE Client AL JB Mon 11/3/08 Wed 11/5/08 3d • 1.1.3.9 Que2: Integration Test w/2-QUE Client AL JB Thu 11/6/08 Mon 11/10/08 3d • 1.1.3.10 Que2: Jim_Client 2-QUE Support: Client DeployAL DEV,JB Mon 11/10/08 Mon 11/10/08 1d • 1.1.3.11 Milestone: Que2 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d • 1.1.4 Jim_Client Development for 2 Queue Nodes Support GG Mon 11/3/08 Wed 11/5/08 3d • 1.1.4.1 Jim_Client: 2-QUE Node Support: Develop, Package GG ABa Mon 11/3/08 Tue 11/4/08 2d • 1.1.4.2 Jim_Client: 2-QUE Node Support: Test w/o Que2 GG ABa Wed 11/5/08 Wed 11/5/08 1d
D0 Grid Data Production Task Status (3 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.5 New Distinct Sam Station AL Wed 10/1/08 Fri 11/14/08 33d • 1.1.5.1 SAM Station: Identify Hardware For Role AL FEF Wed 10/1/08 Wed 10/15/08 11d • 1.1.5.2 "SAM Station: Repurpose, OS Install" AL FEF Thu 10/16/08 Fri 10/17/08 2d • 1.1.5.3 SAM Station: Verify Platform Installation AL AL Mon 10/20/08 Tue 10/21/08 2d • 1.1.5.4 SAM Station: Setup Station AL AL Thu 10/23/08 Wed 10/29/08 5d • 1.1.5.5 SAM Station: Pre-Deployment Test AL AL Thu 10/30/08 Wed 11/5/08 5d • 1.1.5.6 SAM Station: Deployment Plan AL AL Thu 11/6/08 Thu 11/6/08 1d • 1.1.5.7 Milestone: SAM Station Ready to Deploy AL Thu 11/6/08 Thu 11/6/08 0d • 1.1.5.8 SAM Station: Setup Context Server AL AL Thu 11/13/08 Fri 11/14/08 2d • 1.1.6 Deployment Stage 1 AL Mon 11/10/08 Tue 11/25/08 12d • 1.1.6.1 Deployment 1: Plan: Split Data/MC Production Services AL ALL Mon 11/10/08 Wed 11/12/08 3d • 1.1.6.2 Deployment 1: Execute AL REX Thu 11/13/08 Mon 11/17/08 3d • 1.1.6.3 Deployment 1: Monitor AL REX Tue 11/18/08 Mon 11/24/08 5d • 1.1.6.4 Deployment 1: Sign-off AL REX Tue 11/25/08 Tue 11/25/08 1d • 1.1.6.5 MILE 1: Deployment 1 Completed AL Tue 11/25/08 Tue 11/25/08 0d
D0 Grid Data Production Task Status (4 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info GG Mon 10/13/08 Tue 11/11/08 22d • 1.3.1.1 "Use "Same" Proxy for Gridftps" GG PM Wed 11/5/08 Fri 11/7/08 3d • 1.3.1.2 New Job Status Value at QUE Node GG PM Mon 10/13/08 Fri 11/7/08 18d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Mon 11/10/08 Tue 11/11/08 2d • 1.3.1.4 "Upgrade D0Runjob: Test, Make Workable" AL MD Mon 10/27/08 Fri 11/7/08 10d • What does this require to be done? • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data Prod AL REX Tue 11/11/08 Tue 11/11/08 0d • 1.3.2 Slow Fwd-CAB Job Transition "AL,GG" Wed 10/1/08 Mon 11/17/08 34d • 1.3.2.1 Investigation and Recommendations GG PM Wed 10/1/08 Tue 10/14/08 10d • 1.3.2.2 LINK: Condor 7 Upgrade Fixes Deployed AL Mon 11/17/08 Mon 11/17/08 0d • 1.3.2.3 Increase FileMax value to 16k on FWD1-3 on the fly AL REX Tue 11/4/08 Tue 11/4/08 1d • 1.3.2.4 Add FileMax value change to FWD Install Document AL REX Wed 11/5/08 Wed 11/5/08 1d • 1.3.2.5 Milestone: FileMax Value Change Deployed AL Wed 11/5/08 Wed 11/5/08 0d • 1.3.2.6 Milestone: Known Palliatives for Slow Job Transitions Deployed AL Mon 11/17/08 Mon 11/17/08 0d • 1.3.3 Improved H/w Uptime AL Mon 10/13/08 Mon 10/13/08 1d • 1.3.3.1 "Consider a FWD5: Full decoupling w/o virtualization, improved robustness to FWD node failures" AL AL Mon 10/13/08 Mon 10/13/08 1d
D0 Grid Data Production Metrics Summaries • Work In Progress, Breaking Down Possible States Involved • We want to account for downtime: be fair, but from customer’s view • “Draining farm for new version of Reco” = customer driven, not D0 Grid • “Scheduled downtime = charged to D0 Grid Service, discretionary (or not) • “SRM at Purdue jams up SAM station” = charged to D0 Grid Service • So far, appears that MOST resource non-utilization in September 2008 was due to something other than slots being empty. Rework = wasted CPU cycles too. • Questions • CPU vs Slots Used: Do we have a ganglia summary for D0Farm? • Slots Used vs Events Produced: Do we count an only once even if resubmitted N times? • This would mean rework leads to busy slots, busy CPUs, and NO additions events/day • Rework Causes – break down observed causes… will arrange outside of this meeting • Complete September plot for nSubmissions per dataset? • Stepping Back: How well are the original 16+4 issues covered in Phase 1? • (next week) <end>