90 likes | 266 Views
Lessons Learned: The Organizers. Kinds of Lessons*. Operational Distributing the Code Making the sky data Required compute resources Required people resources Remaking the sky data Remaking the sky data Distributing the data - DataServers Functional Problems extracting livetime history
E N D
Kinds of Lessons* • Operational • Distributing the Code • Making the sky data • Required compute resources • Required people resources • Remaking the sky data • Remaking the sky data • Distributing the data - DataServers • Functional • Problems extracting livetime history • Problems extracting pointing history – SAA entry/exit • Organizational • How/when to draw on expert help for problem solving • Sky model • Confluence/Workbook • Analysis • Access to standard cuts • GTIs • Livetime cubes, diffuse response * Or things to fix for next time
Making the Sky 1 • Code Distribution • Navid made nice self-installers with wrappers that took care of env vars etc • Creation of distributions is semi-manual. • Should find out how to automate – rules based • We needed a lot more compute resources than we anticipated • 200k CPU hrs for background and sky generation • Did sky gen (30k CPU-hrs) twice • need more compute resources under our control then planned – maxed out at SLAC with svac, DC2, BT, Handoff • Aiming for 350-400 “GLAST” boxes + call on SLAC general queues for noticeable periods • Berrie ran 10,000 jobs at Lyon for the original background CT runs – a horrible thing to have to do • Manualy transferred merit files back to SLAC • Extend LAT automated pipeline infrastructure to make use of not-SLAC compute farms (may or may not have to transfer files back to SLAC) – Lyon; UW; Padova?; GSFC-LHEA? • Speaks to maximizing sims capabilty • We juggled priority with SVAC commissioning • Pipeline 1 handles 2 “streams” well enough • More would have been tricky • Ate up about 3-4 TB of disk to keep all MC, Digi, Recon etc files • Pipeline 2
Making the Sky 2 • People resources • Tom Glanzman put his BABAR expertise to minimize exposure to SLAC resource bottlenecks • Accessing nfs from upwards of 400 CPUs was the biggest problem • Use afs and batch node local disk as much as possible • Made good use of SCS’ Ganglia server/disk monitoring tools • Developed pipeline performance plots (as shown at the Kickoff meeting) • Tom and I (mostly Tom) ran off the DC2 datasets • Some complexity due to secret sky code and configs • Some complexity due to last minute additions of variables calculated outside Gleam • Effort front loaded – setting up tasks • Now a fairly small load to monitor/repair during routine running • Some cleanup at the end • Root4Root5 transition disrupted the DataServer • Will likely need a “volunteer” for future big LAT simulations
Grab Bag • Great to have GBM involved! • Should at least have archival copy of GBM simulation code used • DC2 Confluence worked • Nice organization by Seth on Forum and Analysis pages • Easy to use and peruse • Will clone for Beamtest • Great teamwork • It was really fun to work with this group • The secret sky made it hard to ask many people to help with problems – but that is behind us now • Histories • Pointing and livetime needed manual intervention to fix SAA passages etc. Should track that down. • Analysis details • Might have been nice to have Class A/B in merit (IMHO) • GTIs were a pain if you got them wrong. Tools now more tolerant. • Livetime cubes were made by hand • Diffuse Response in FT1 was somewhat cobbled together
GSSC Data Server 890 hits total during DC2 • repopulating the server is manual; 2 months takes about 5 hrs • brings up questions: • what chunks of data will be retransmitted to GSSC? • what are “failure modes” for data delivery • what will “Event” data look like? • how many versions of data to be kept online in servers?
LAT DataServer Usage • ½ of usage from Julie! • similar questions posed as from GSSC server
Lessons • Statistics don’t include “astro” data server or WIRED event display use. • Lessons Learned • Problem: Jobs running out of time • Need more accurate way to predict time, or run jobs with no time limit • Problem: Need clearer notification to user if job fails • LAT Astro server never got the GTIs right • Hence little used, even as west coast US mirror • Were not able to implement efficient connection to Root files (main reason for its existence). Still needs work. • Unknown if limited use of Event Display is significant.