400 likes | 527 Views
Strategy and Tools to support the exploitation model OP-Centric View. Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M . Buttner , S . Deghaye, F . Ehm , R . Gorbonosov, G. Kruk, J . Lauener, M . Pace, and others. Strategic Goals for 2014.
E N D
Strategy and Tools to support the exploitation modelOP-Centric View Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye, F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others
Strategic Goals for 2014 • Enable/motivate OP to do first line diagnosis and find the right expert to call • Get CO project teams to contribute to first line diagnostic tools • Prevent failures => less need for 1st line diagnostics (not covered in this presentation)
Outline • Overview of basic diagnostic use cases • Tools available and possible improvements for each case • Tools for more advanced diagnosis • Further aspects and a summary
Different1st Line Diagnostic Tools for different Cases • Case 1: Failures in request/reply systems • operator clicks on a button, expects a result but gets an error or a timeout • E.g. InCAWorkingSet, LSA Applications, Sequencer, OASIS, … • Case 2: Alarms in LASER, Red entries in DiaMon • Case 3: Failures in monitoring-based systems • E.g. Logging system, PostMortem Analysis, Software Interlock, timing, FixedDisplays, … • Often background processes without GUI • Failure not immediately visible to OP • OP discovers error later in some place not directly related to fault
Case 1: Request Failures seen in GUIs • Goal: OP should not just call the responsible of the GUI… …But identify the system that caused the error and call the responsible expert • Some integrated diagnostic tools are available (or promised) • WorkingSet/Knob diagnostics • InCA Server diagnostics (promised for April) • “Why-no-permit” functionality in Software Interlock System • JAPC Monitoring Diagnostic GUI • …
Case 1: Request Failures seen in GUIs • OP should not just call the responsible of the GUI • But identify the system that caused the error and call the corresponding expert • Integrated diagnostic tools are available (or promised) • WorkingSet/Knob diagnostics • InCA Server diagnostics • “Why” functionality in Software Interlock System • JAPC Monitoring Diagnostic GUI • … • Good error messages can be very useful for OP (c.f. e-logbook) • Bad ones are not even useful for the expert
Examples of Good (because clear) error messages Exception: [cern.japc.ParameterException] "Server 'BQBBQLHC.cfv-ua47-bqpll' is down or unreachable" Problem loading the settings for the TCTs: TCTVA.4L2.B1: java.lang.Exception: Rejected, first point of profile different from current position! asynchronous operation on LHC.OFSU/Orbit@LHC.USER.LHC failedcern.japc.ParameterException: Server overloaded and not able to handle all requests, will loose updates.
One long but good (because complete) error message LHC OP logbook [Tuesday 04-Sep-2012 Morning]2012/09/04 13:45:57 DRIVE ABORT GAP CLEANING SETTING [ERROR] DriveException [Failed to drive some parameters AbortGapCleanVB1/CleaningStrength: cern.cmw.AccessDenied: RBAC Exception: Authorization access denied because there is no matching access rule. Transaction: [SET on AbortGapCleanVB1#CleaningStrength] Token [serial=0x71eb758d; user=lhcop; authTime=2012.09.04 08:04:22; endTime=2012.09.04 16:03:22; application=LHC Sequencer GUI;location=CCC-LHC; locationAddress=172.18.200.124; roles=[DIAMON-Operator, SeqLhcOperator, SeqHwcOperator, SeqSpsOperator, LHC-Operator, OP-Daemon, SPS-Operator]] RDA Info [clientHost=cs-ccr-lsa1.cern.ch; remoteUser=copera; serverName=ALLAbortGapClean.cfv-sr4-adtvb1] RBAC Setup [policy=strict, mode="OPERATIONAL"]
RBAC Error message (cont) RBAC Access Map excerpt for [SET on AbortGapCleanVB1#CleaningStrength]: Class Property Device Role Application Location Mode Operation ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * RF-864-Control-Room * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * RF-SR4-Control-Room * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * * NON-OPERATIONAL SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 * * RF-FrontEnds * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 * * RF-Servers * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-LHC-Piquet * * * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * CCC-LHC * SET Clearly a team that got too many spurious support requests ;-)
Ambiguous Error Message LHC OP [Tuesday 04-Sep-2012 Afternoon] 20:35:01 - 2012-09-04 20:35:01,171 [JapcExecutor-thread-31] ERROR JapcDispatcherImpl +==> Exception occured for parameter id [HC.BLM.SR5.R_ENERGY_ACQ] name [HC.BLM.SR5.R/SISMonitoring]asynchronous operation on HC.BLM.SR5.R/SISMonitoring@ failed cern.japc.ParameterException: Notification failure. at cern.japc.ext.cmwrda.DataTranslator.createParameterException(DataTranslator.java:269) at cern.japc.ext.cmwrda.RDAReplyHandler.handleError(RDAReplyHandler.java:115) at cern.cmw.rda.client.ReplyAdapter.handleException(ReplyAdapter.java:88) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2622) at cern.cmw.corba.rda.ReplyHandlerPOA._invoke(ReplyHandlerPOA.java:66) at org.jacorb.poa.RequestProcessor.invokeOperation(RequestProcessor.java:297) at org.jacorb.poa.RequestProcessor.process(RequestProcessor.java:591) at org.jacorb.poa.RequestProcessor.run(RequestProcessor.java:734) Caused by: cern.cmw.IOError: [FesaInternalError - -1] Notification failure. at cern.cmw.rda.client.ServerConnection.insertValue(ServerConnection.java:2768) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2610) ... 4 more FESA or CMW problem?
Improvements for Request/Reply Failures • Provide good error messages, especially on system boundaries • Short information understandable by OP to pinpoint faulty component • Complete information (e.g. stack trace) for offline analysis by experts, with enough context information (e.g. arguments that caused the failure) • Enforce API contracts between your system and others • Check incoming arguments/data from callers (users, clients) • Check results obtained from services (Java servers, FESA device, DB, etc) • Fail early with clear error messages (don’t wait until a NullPointerException) • Good examples: Sequencer (tasks), PostMortem Analysis Framework (modules) • Actions: • Implement above recommendations • Code reviews of important API contracts • Use Unit tests / testbed also to validate error handling and error messages • Complain (and file JIRA issues) about non-comprehensible error messages and stack traces
DiaMon – Diagnostics and Monitoring tool for OP • Tool made for OP to carry out first-line diagnostics • Monitoring of metrics: OS-level, JMX, CMX • Comparison of metrics with configurable thresholds => red/yellow/green • Monitoring of processes defined in transfer.ref: Green if process is running, red if not • Corrective actions: restart processes, reboot computers • Misc Diagnostics: e.g. ping computers, login to console
DiaMonImprovement/Next steps • More flexibility for thresholds • Make it easier to adjust thresholds (why not from DiaMon GUI?) • Use dynamic thresholds in some cases, based on past values(?) • Increase capability: from process monitoring to service monitoring • Problem now: A Process runs (is green in DiaMon) but doesn’t provide expected service (should be red). • Actions to : • Review threshold management • System responsibles should provide sanity checks (c.f. later) • Framework providers and system providers should expose JMX/CMX metrics and thresholds
Right-click on alarms shows details: responsible, link to documentation, etc.
LASER Alarm System • Alarms • Inform OP of problems that need human intervention • Can be raised by systems (e.g. FESA devices, timing, Accelerator Logging, SIS, …) • Can show DiaMon metrics that exceed Error Threshold (if configured to generate alarm) • Assessment • Very effective mechanism to inform OP about problems • Include responsible(s) to call and link to documentation of error • OP must invest time to look after alarms • Risk of too many alarms (screen full, cannot see all/new alarms) • Improvements/Actions • LASER 2 will introduce a procedure whereby OP must approve alarms • Effort needed to clean up alarm configurations (responsible + documentation) • Mode-dependent alarms exist already but must be configured • Sanity check and procedure needed for OP to assess that LASER itself works fine
“Jvideo” for OP to see if timing works Alarm in LASER if no timing on Front-Ends
Extraction API f CERN Accelerator Logging Service - Architecture Overview >500 registered users ~ 850’000 signals ~ 300 data loading processes ~ 4 billion records per day ~ 140 GB per day 50 TB per year stored Data Consumers >100 custom applications > 3 million extraction requests per day >20 Years filtered data 7 Days raw data PL/SQL filtered data transfer ~ 250’000 Signals ~ 16 data loading processes ~ 5.4 billion records per day ~ 275 GB per day 90 TB per year throughput Data Persistence PL/SQL API Data Providers
General Case of Failures in Monitoring-based Systems • Examples: • FixedDisplay not updated, device not sending updates, PM data missing, data missing for SIS permit, … • Any GUI not updating! • Theoretically the diagnostic algorithm is simple: • Start with the system where the fault was detected • Find all other involved systems • Check each and find out which one is failing • Reality is challenging: • Need to know or find all involved systems • (LSA/InCA server, DB, Logging processes, Proxies, JMS Brokers, devices, FECs, Timing system, OS resources, …) • Difficult to diagnose if intermediate systems work
New ACET Runtime Dependency Viewer (alpha) Still a lot of work… More in a TC early next year
Checking individual systems if they are OK or not • Idea: Do the same checks as system expert, but automate and require less domain knowledge • System-specific sanity checks => OK/WARN/ERROR • Functional test: Execute typical service request and check result • Checking relevant metrics • Checking configuration • Examples of functional sanity checks (from headless clients) • LSA: load test settings, modify (trim) them, save to DB, revert • FESA: trigger test event, execute RT action, notify CMW client • JMS broker: send message through broker back to you, check delay • RBAC: login as test user and check reply (e.g. token + expected roles) • Database: check access and response time • http://abwww: periodically download file and check contents and delay • …
Checking individual systems if they are OK or not • Examples of relevant metrics • Liveness counters (e.g. LSA Client requests, Accelerator Logging data flow, CMW updates to clients, FESA RT actions, Timing events, …) • Error indicators: error/exception rate; number of packets lost; too frequent Java Garbage collection • Checking configuration • Comparison between CCDB configuration and reality • Version Number of FESA SW, OS, Drivers, Firmware, HW(?), … • Based on configuration feedback from FECs to CCDB
Checking individual systems – status and plans • Current situation • Sanity checks generally accepted as good idea, but very few teams have implemented them • Pioneers in this area: Accelerator Logging team • Ideas and promises from other teams (FESA, CMW liveness counters) • Infrastructure is ready to be used (DiaMon, JMX, CMX, config feedback) • Actions • Someone has to drive and coordinate this effort (me?) • System teams have to find time to implement sanity checks • Ultimate goal: sanity checks integrated in DiaMon(green=green, red=red)
Tools for Advanced Operators or CO Special Shift Workers Easy browsing and correlation of all our logfiles “Taming” the expert diagnostic tools
Easy browsing and correlation of Logfiles • Idea: • If a system has problems, there should be errors in the log files • If not, there shouldn’t be any • Current situation: our logfiles need to be improved • Expert information, often difficult to understand • Logfiles are often contain spurious “normal” error messages • Logfile (“tracing”) information from Linux Syslog and FESA/CMW servers is centrally collected on cs-ccr-tracing. Java not yet. • Splunk allows for advanced searching and correlation • Improvements/Actions • Introduce logging culture (e.g. don’t tolerate spurious “normal errors”) • Reduce log volume by filtering out endlessly repeated messages • Add Java logging to Splunk • Promote Splunkamong CO experts (Link from CCM. Training)
“Taming” the Expert Diagnostic Tools • Every team has developed good tools and use them • CMW Admin tool (new functionality in RDA3) • Lemon system monitoring tool • Timing diagnostic tools (dtm_diag, mtg_diag, test_tgm, ctr_test, …) • FESA navigator + Expert diag tool • Java Mission Control (JConsole on steroids) • Driver test suites • Worldfip diagnostic tools • … • Improvement/Actions: • Make a user-friendly subset of the functionality available for 1st line diagnostics • Important to include feedback from CO special shift workers • Training?
Acceptance tests before accepting to support systems • Insist that systems are ready for 1st line support • 1st line diagnostic tools and documentation must be available • Sanity checks and metrics + thresholds must be configured in DiaMon • Valid responsibles (egroups) must be registered for computers, controls processes, LASER alarms, etc • Restarting (wreboot/reboot) must be enabled for OP only if not risky • If restarting is available it must improve the situation • Good error messages in GUIs and/or in logfiles • No spurious error messages (e.g. “normal errors”) • Formal acceptance procedure to ensure the above criteria are met? • Systems that don’t meet criteria won’t get 1st line support (?) • No diagnostic attempts will be made • Experts will be called directly at any time
Summary • New exploitation model requires better 1st line diagnostic tools • A lot of good expert diagnostic tools are available • But only few diagnostic tools are suitable for OP • A common effort from the CO different projects is necessary to contribute • 1st line support coverage only for systems that “deserve” it