80 likes | 204 Views
ES Slowdown, Optimization, Testing. Plan for shutdown: Timeline. April: Focus on resolution of major outstanding issues:
E N D
Plan for shutdown: Timeline • April: Focus on resolution of major outstanding issues: • Bulk data deployment stable use of multiple arrays: Status nominally allows effective use of multiple arrays but not sure if issues with high rate of execution failure (5 of 15-20) is related • Notification channel issues: often results in long delays at start up, missing data at times (wvr/tsys) (Problem comes and goes) • Data capture issues: Issues with container crashes resulting in lost data. Limits maximum execution time. Status: Fixes in, requires testing/verification. • “Handover” time has grown: Systems not coming up, is a combination of hardware and software. Tzu/Nick/Emilio working on this. • May: • Start acceptance testing • Focus on simulation and testing improvements begins • Potential missions to work on remaining issues if needed • Focus of debugging moves from issues affecting Cycle 1 to issues affecting CSV scope of testing for future cycles.
Plan for shutdown: planned missions • Bulk data: Completed mission. Issues still remain. CSV is using a “stable” which has troubles (typically about 5-7 executions a night fail for reasons that look like timeout issues) • Notification channel mission: Unfortunately marginalized by power shutdown • Data capture mission after this meeting
Plan for shutdown: Obsmode Suite • Test suite for basic, science-like executions. • Tests SSR/SOS functionality, data recording, range of Cycle 1 capabilities • Completed late March: taking awhile to get repeatedly good datasets • Despite this reasonable data now getting to the pipeline testers (SACM group). • This will become SB execution regression for Cycle 1. Will be extended for Cycle 2 capabilities as they come/are verified (already have polarization “science-like” SB that will evolve this way) • All reduction intended to be done with the Pipeline. If it can’t be, it will advise the reduction of new modes.
Plan for shutdown: Software basic • Minimal set of tests to run at weekly regression to verify functionality • Will likely consist of Total Power, Autocorrelation, ACA+BL correlator run at the same time (4-5 executions). • Total power raster • Auto correlation raster (to be combined with above when dual mode works) • ACA+BL executions of PNT, SBR, Tsys, Bandpass, PNT, Tsys, Bandpass • These are nominally not directly reducible from the pipeline. • Initial set defined, need to iterate with ADC on the details. • Initial proposal was sent to ADC, has evolved a bit. • Tool kit being developed based on MS side. Metrics will include things like “detectDelayJump(threshold,timescale)”, “detectPlatforming(threshold)”, etc. • Also using scan, spw, data size metrics to make sure everything that should be there is there. • Check flagging fraction • It is assumed that this is throw away code that will be implemented as metrics in the pipeline eventually. Discuss timeline? • Intended that basic executions are not pipeline reducible (too much overhead for weekly regression) • Idea is for computing to run, science to provide pass/fail criteria • Contributions likely from CSV, DSO, ARCs and spans SACM, DMG and Pipeline related staff. • Deadline for design and execution blocks: April 30 • Deadline for toolkit TBD (in progress, will likely evolve)
Plan for shutdown: Plan for intensive • Designed to catch issues often present in major releases • Again, design tools that can eventually put into Pipeline when time allows • All of these need SSR work to make things automatic in terms of source selection for “science target” as well as calibrators. Special execution scripts are not needed. • Designed to use Pipeline: (initial reduction done in Pipeline, all tool creation is in progress to be absorbed into Pipeline eventually) • Frequency labels (SB created) • Phase transfer, Phase/delay jump (mixed mode, SB created) • Return to phase/delay after band change (SB to be made by end of April) • TDM phase/delay jump and platforming detection (SB to be created by end of April, fast dumps) • Scan sequence stresses/latency check (SB to be created by end of April) • Not to be reduced at least to first order in Pipeline • Verify execution of all CalTargets and results which are repeatable • Includes data checks to “applied online” as well as “reduced offline” targets • Intensive suite will incorporate new capabilities as they come forward with the goal of not introducing new tests but incorporating new features into the old tests (not a new idea…)
Plan for next year: SSR/SOS and unit tests • SSR/SOS review completed Monday/Tuesday • High priority placed on query interface refactor: • Would like to eventually migrate things into the calibrator catalog interface but will design to ease this at a later date • Target based queries go into the target • High priority placed on merging observing mode functions that are in the SSR/SOS side to Control when needed, make SSR/SOS obsmode inherit from Control, not other way around (don’t ask…) • Development of Sessions, Observatory Calibration Scripts and new modes will add a layer of ObservingStrategy. • Timeline for this full refactor is ~1 year given manpower and need to develop some new functionality on our side. • Development will be done in parallel branches with refactor worked on in one branch and separable new capabilities in another • ScanLists will manage logic of execution breaks (currently the ScanList is a dumb handler) • Unit tests will be updated as time allows • Development/refactor assignments: N Phillips (SIST) observatory calibration scripts; P Cortes (DSO) sessions and observing strategy rework; Ignacio Toledo (DSO-DA) query refactor; S Corder (CSV) ScanListintelligence design assignments as possible to other groups (this item is completely dependent on refactor, not on critical path for Cycle 2).
Optimization/Coordination • Who will do which work? During what array time? • What is the timescale for getting performance metrics into the pipeline? • Has CSV left anything out to help provide a long term viable observatory operational model (>3 years)? • Are the divisions of the testing suites appropriate/complete? • What is the level of support that can be provided with the refactor/unit tests? • What is the model for getting more coordinated and complete testing into the lower level? • Can we test with a more realistic simulation environment? (Better testing of interactions?) • Can we test with better scalability considerations?