1 / 9

Lessons Learned: The Organizers

Lessons Learned: The Organizers. Kinds of Lessons*. Operational Distributing the Code Making the sky data Required compute resources Required people resources Remaking the sky data Remaking the sky data Distributing the data - DataServers Functional Problems extracting livetime history

ezhno
Download Presentation

Lessons Learned: The Organizers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lessons Learned:The Organizers

  2. Kinds of Lessons* • Operational • Distributing the Code • Making the sky data • Required compute resources • Required people resources • Remaking the sky data • Remaking the sky data • Distributing the data - DataServers • Functional • Problems extracting livetime history • Problems extracting pointing history – SAA entry/exit • Organizational • How/when to draw on expert help for problem solving • Sky model • Confluence/Workbook • Analysis • Access to standard cuts • GTIs • Livetime cubes, diffuse response * Or things to fix for next time

  3. Making the Sky 1 • Code Distribution • Navid made nice self-installers with wrappers that took care of env vars etc • Creation of distributions is semi-manual. • Should find out how to automate – rules based • We needed a lot more compute resources than we anticipated • 200k CPU hrs for background and sky generation • Did sky gen (30k CPU-hrs) twice •  need more compute resources under our control then planned – maxed out at SLAC with svac, DC2, BT, Handoff • Aiming for 350-400 “GLAST” boxes + call on SLAC general queues for noticeable periods • Berrie ran 10,000 jobs at Lyon for the original background CT runs – a horrible thing to have to do • Manualy transferred merit files back to SLAC • Extend LAT automated pipeline infrastructure to make use of not-SLAC compute farms (may or may not have to transfer files back to SLAC) – Lyon; UW; Padova?; GSFC-LHEA? • Speaks to maximizing sims capabilty • We juggled priority with SVAC commissioning • Pipeline 1 handles 2 “streams” well enough • More would have been tricky • Ate up about 3-4 TB of disk to keep all MC, Digi, Recon etc files •  Pipeline 2

  4. Making the Sky 2 • People resources • Tom Glanzman put his BABAR expertise to minimize exposure to SLAC resource bottlenecks • Accessing nfs from upwards of 400 CPUs was the biggest problem • Use afs and batch node local disk as much as possible • Made good use of SCS’ Ganglia server/disk monitoring tools • Developed pipeline performance plots (as shown at the Kickoff meeting) • Tom and I (mostly Tom) ran off the DC2 datasets • Some complexity due to secret sky code and configs • Some complexity due to last minute additions of variables calculated outside Gleam • Effort front loaded – setting up tasks • Now a fairly small load to monitor/repair during routine running • Some cleanup at the end • Root4Root5 transition disrupted the DataServer • Will likely need a “volunteer” for future big LAT simulations

  5. Grab Bag • Great to have GBM involved! • Should at least have archival copy of GBM simulation code used • DC2 Confluence worked • Nice organization by Seth on Forum and Analysis pages • Easy to use and peruse • Will clone for Beamtest • Great teamwork • It was really fun to work with this group • The secret sky made it hard to ask many people to help with problems – but that is behind us now • Histories • Pointing and livetime needed manual intervention to fix SAA passages etc. Should track that down. • Analysis details • Might have been nice to have Class A/B in merit (IMHO) • GTIs were a pain if you got them wrong. Tools now more tolerant. • Livetime cubes were made by hand • Diffuse Response in FT1 was somewhat cobbled together

  6. GSSC Data Server 890 hits total during DC2 • repopulating the server is manual; 2 months takes about 5 hrs • brings up questions: • what chunks of data will be retransmitted to GSSC? • what are “failure modes” for data delivery • what will “Event” data look like? • how many versions of data to be kept online in servers?

  7. LAT DataServer Usage • ½ of usage from Julie! • similar questions posed as from GSSC server

  8. Lessons • Statistics don’t include “astro” data server or WIRED event display use. • Lessons Learned • Problem: Jobs running out of time • Need more accurate way to predict time, or run jobs with no time limit • Problem: Need clearer notification to user if job fails • LAT Astro server never got the GTIs right • Hence little used, even as west coast US mirror • Were not able to implement efficient connection to Root files (main reason for its existence). Still needs work. • Unknown if limited use of Event Display is significant.

More Related