1 / 9

D0 Grid Data Production Initiative: Coordination Mtg 5

Version 1.0 (pre-meeting draft) 16 October 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 5. Outline. News – (RDK has none this week) Action Items from last week Past Week/This Week Status Check

ronli
Download Presentation

D0 Grid Data Production Initiative: Coordination Mtg 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Version 1.0 (pre-meeting draft) 16 October 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 5 D0 Grid Data Production

  2. D0 Grid Data Production Outline • News – (RDK has none this week) • Action Items from last week • Past Week/This Week Status Check • Initiative Schedule v0.8 (closer to baseline) • Deployment “Feature List”

  3. D0 Grid Data Production News Items • Any?…

  4. D0 Grid Data Production Open Action Items • Resource load schedule, include vacations and other leave (RDK) • Mostly done. Need to replace group with individual assignments… this week. Need to add status into schedule --- please, do not panic when you see “nothing” done! • Have not assigned fractions of people since “leveling” already done by hand. • Setup JIRA status reporting for critical path tasks, review staffing (AL) • Done, against schedule v0.6 • Investigate hardware for a FWD5 to be used at least until virtualization on FWD4 is deployed (AL) • Done. Will use “15” as FWD at least until virtualization deployed. Detailed in schedule v0.8. Backfill to be considered. • Plan to deploy Condor 7 earlier… make it happen (RDK, GG, AL) • Done. Will deploy patched “old” v1.10.1, replace with “new” v1.10.1 from UWisc at soonest deployment after it is ready and tested. • Integrate “Slow Job Transitions” experience into Plan. (RDK, AL) • Defer “add alarm” as non-critical, but highly desirable for later phase. • Add a task “increase FILEMAX to 16k on FWD#” to schedule. In schedule v0.6+

  5. D0 Grid Data Production Past/This Week Tasks (1 of 2)(Red means a critical task chain, Green means effectively done, … Who is doing work?) • 1.1.1 Fwd4 Platform AL • 1.1.1.1 INPUT: Fwd4 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d (DONE) • 1.1.1.2 Fwd4 Server Hardware OS Install AL FEF Wed 10/1/08 Fri 10/10/08 8d • 1.1.1.3 Fwd4 Server Hardware Burn-in AL FEF Mon 10/13/08 Fri 10/17/08 5d • 1.1.2 Fwd4 Grid Service Software and Configuration AL • 1.1.2.1 Receive Fwd4 AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.2.2 Prepare Deployment 1 Fwd4 Configuration AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.3 Fwd5 Platform AL • 1.1.3.1 Identify FWD5 hardware: repurpose Test Fwd Node AL AL Tue 10/14/08 Tue 10/14/08 1d • 1.1.3.2 "Re-install OS on d0srv015, rename d0samgfwd5" AL FEF Wed 10/15/08 Fri 10/17/08 3d • 1.1.4 Fwd5 Grid Service Software and Configuration AL • 1.1.4.1 Receive Fwd5 AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.4.2 Prepare Deployment 1 Fwd5 Configuration AL REX Thu 10/16/08 Fri 10/17/08 2d • 1.1.5 Que2 Platform AL • 1.1.5.1 INPUT: Que2 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d (DONE) • 1.1.5.2 Que2 Server Hardware OS Install AL FEF Wed 10/1/08 Fri 10/10/08 8d • 1.1.5.3 Que2 Server Hardware Burn-in AL FEF Mon 10/13/08 Fri 10/17/08 5d • 1.1.6 Que2 Grid Service and Client Software and Configuration AL • 1.1.6.1 Receive Que2 AL REX Thu 10/16/08 Fri 10/17/08 2d

  6. D0 Grid Data Production Past/This Week Tasks (2 of 2) • (Continued from previous page) • 1.1.7 New Sam Station Platform AL • 1.1.7.1 Identify Hardware For Role AL FEF Wed 10/1/08 Fri 10/17/08 13d • 1.3 Small Quick Wins • 1.3.1 SAM-Grid Job Status Info GG • 1.3.1.1 Unlimited-time Proxy for Gridftps GG PM Mon 10/13/08 Wed 10/15/08 3d • 1.3.1.2 New Job Status at QUE Node GG PM Thu 10/16/08 Fri 10/31/08 12d • 1.3.3 Improved H/w Uptime AL Mon 10/13/08 Mon 10/13/08 1d • 1.3.3.1 "Consider FWD5: Full decoupling w/o virtualization, improved robustness to FWD node failures" AL AL Mon 10/13/08 Mon 10/13/08 1d (DONE) • May need to backfill the old test FWD node… or perhaps even running all “spares” in production for hot fall-back?

  7. D0 Grid Data Production Today Th-Day Holiday Schedule v0.8 (Phase 1) Fwd 4 Prep Fwd 5 Prep Que 2 Prep SAM’ Prep Deploy 1 VDT “new” Deploy 2 Job Status Dev Filemax Metrics Summaries

  8. D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services • Time frame: Nov 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd to all Merging • 1. Fwd4 deployed w/o virtualization • 2. Fwd5 deployed • 3. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 4. New SAM Station (moved off of FWD1) • 5. Condor 7 via “old” 1.10.1 + patches • New 1.10.1 official release from UWisc if available and tested in time • Deployment 2: Optimize Data and MC Production Configurations • Time frame: Dev 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. Condor 7 via “new” 1.10.1 official release from UWisc

  9. D0 Grid Data Production Meeting Discussion Summary • …

More Related