1 / 40

Strategy and Tools to support the exploitation model OP-Centric View

Strategy and Tools to support the exploitation model OP-Centric View. Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M . Buttner , S . Deghaye, F . Ehm , R . Gorbonosov, G. Kruk, J . Lauener, M . Pace, and others. Strategic Goals for 2014.

lola
Download Presentation

Strategy and Tools to support the exploitation model OP-Centric View

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Strategy and Tools to support the exploitation modelOP-Centric View Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye, F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others

  2. Strategic Goals for 2014 • Enable/motivate OP to do first line diagnosis and find the right expert to call • Get CO project teams to contribute to first line diagnostic tools • Prevent failures => less need for 1st line diagnostics (not covered in this presentation)

  3. Outline • Overview of basic diagnostic use cases • Tools available and possible improvements for each case • Tools for more advanced diagnosis • Further aspects and a summary

  4. Different1st Line Diagnostic Tools for different Cases • Case 1: Failures in request/reply systems • operator clicks on a button, expects a result but gets an error or a timeout • E.g. InCAWorkingSet, LSA Applications, Sequencer, OASIS, … • Case 2: Alarms in LASER, Red entries in DiaMon • Case 3: Failures in monitoring-based systems • E.g. Logging system, PostMortem Analysis, Software Interlock, timing, FixedDisplays, … • Often background processes without GUI • Failure not immediately visible to OP • OP discovers error later in some place not directly related to fault

  5. Case 1: Failures in Request/Reply systems

  6. Case 1: Request Failures seen in GUIs • Goal: OP should not just call the responsible of the GUI… …But identify the system that caused the error and call the responsible expert • Some integrated diagnostic tools are available (or promised) • WorkingSet/Knob diagnostics • InCA Server diagnostics (promised for April) • “Why-no-permit” functionality in Software Interlock System • JAPC Monitoring Diagnostic GUI • …

  7. InCAWorking Set - Diagnostic tools

  8. InCAWorking Set - Diagnostic toolbox

  9. InCAWorking Set – JAPC/RDA diagnostics

  10. Case 1: Request Failures seen in GUIs • OP should not just call the responsible of the GUI • But identify the system that caused the error and call the corresponding expert • Integrated diagnostic tools are available (or promised) • WorkingSet/Knob diagnostics • InCA Server diagnostics • “Why” functionality in Software Interlock System • JAPC Monitoring Diagnostic GUI • … • Good error messages can be very useful for OP (c.f. e-logbook) • Bad ones are not even useful for the expert

  11. Examples of Good (because clear) error messages Exception: [cern.japc.ParameterException] "Server 'BQBBQLHC.cfv-ua47-bqpll' is down or unreachable" Problem loading the settings for the TCTs: TCTVA.4L2.B1: java.lang.Exception: Rejected, first point of profile different from current position! asynchronous operation on LHC.OFSU/Orbit@LHC.USER.LHC failedcern.japc.ParameterException: Server overloaded and not able to handle all requests, will loose updates.

  12. One long but good (because complete) error message LHC OP logbook [Tuesday 04-Sep-2012 Morning]2012/09/04 13:45:57 DRIVE ABORT GAP CLEANING SETTING [ERROR] DriveException [Failed to drive some parameters AbortGapCleanVB1/CleaningStrength: cern.cmw.AccessDenied: RBAC Exception: Authorization access denied because there is no matching access rule. Transaction: [SET on AbortGapCleanVB1#CleaningStrength] Token [serial=0x71eb758d; user=lhcop; authTime=2012.09.04 08:04:22; endTime=2012.09.04 16:03:22; application=LHC Sequencer GUI;location=CCC-LHC; locationAddress=172.18.200.124; roles=[DIAMON-Operator, SeqLhcOperator, SeqHwcOperator, SeqSpsOperator, LHC-Operator, OP-Daemon, SPS-Operator]] RDA Info [clientHost=cs-ccr-lsa1.cern.ch; remoteUser=copera; serverName=ALLAbortGapClean.cfv-sr4-adtvb1] RBAC Setup [policy=strict, mode="OPERATIONAL"]

  13. RBAC Error message (cont) RBAC Access Map excerpt for [SET on AbortGapCleanVB1#CleaningStrength]: Class Property Device Role Application Location Mode Operation ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * RF-864-Control-Room * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * RF-SR4-Control-Room * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * * NON-OPERATIONAL SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 * * RF-FrontEnds * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 * * RF-Servers * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-LHC-Piquet * * * SET ALLAbortGapCleanCleaningStrength AbortGapCleanVB1 RF-Expert * CCC-LHC * SET Clearly a team that got too many spurious support requests ;-)

  14. Ambiguous Error Message LHC OP [Tuesday 04-Sep-2012 Afternoon] 20:35:01 - 2012-09-04 20:35:01,171 [JapcExecutor-thread-31] ERROR JapcDispatcherImpl +==> Exception occured for parameter id [HC.BLM.SR5.R_ENERGY_ACQ] name [HC.BLM.SR5.R/SISMonitoring]asynchronous operation on HC.BLM.SR5.R/SISMonitoring@ failed cern.japc.ParameterException: Notification failure. at cern.japc.ext.cmwrda.DataTranslator.createParameterException(DataTranslator.java:269) at cern.japc.ext.cmwrda.RDAReplyHandler.handleError(RDAReplyHandler.java:115) at cern.cmw.rda.client.ReplyAdapter.handleException(ReplyAdapter.java:88) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2622) at cern.cmw.corba.rda.ReplyHandlerPOA._invoke(ReplyHandlerPOA.java:66) at org.jacorb.poa.RequestProcessor.invokeOperation(RequestProcessor.java:297) at org.jacorb.poa.RequestProcessor.process(RequestProcessor.java:591) at org.jacorb.poa.RequestProcessor.run(RequestProcessor.java:734) Caused by: cern.cmw.IOError: [FesaInternalError - -1] Notification failure. at cern.cmw.rda.client.ServerConnection.insertValue(ServerConnection.java:2768) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2610) ... 4 more FESA or CMW problem?

  15. Improvements for Request/Reply Failures • Provide good error messages, especially on system boundaries • Short information understandable by OP to pinpoint faulty component • Complete information (e.g. stack trace) for offline analysis by experts, with enough context information (e.g. arguments that caused the failure) • Enforce API contracts between your system and others • Check incoming arguments/data from callers (users, clients) • Check results obtained from services (Java servers, FESA device, DB, etc) • Fail early with clear error messages (don’t wait until a NullPointerException) • Good examples: Sequencer (tasks), PostMortem Analysis Framework (modules) • Actions: • Implement above recommendations • Code reviews of important API contracts • Use Unit tests / testbed also to validate error handling and error messages • Complain (and file JIRA issues) about non-comprehensible error messages and stack traces

  16. Case 2: Red entries in DiaMon, Alarms in LASER

  17. DiaMon – Diagnostics and Monitoring tool for OP • Tool made for OP to carry out first-line diagnostics • Monitoring of metrics: OS-level, JMX, CMX • Comparison of metrics with configurable thresholds => red/yellow/green • Monitoring of processes defined in transfer.ref: Green if process is running, red if not • Corrective actions: restart processes, reboot computers • Misc Diagnostics: e.g. ping computers, login to console

  18. Right-click => popup to restart process

  19. DiaMonImprovement/Next steps • More flexibility for thresholds • Make it easier to adjust thresholds (why not from DiaMon GUI?) • Use dynamic thresholds in some cases, based on past values(?) • Increase capability: from process monitoring to service monitoring • Problem now: A Process runs (is green in DiaMon) but doesn’t provide expected service (should be red). • Actions to : • Review threshold management • System responsibles should provide sanity checks (c.f. later) • Framework providers and system providers should expose JMX/CMX metrics and thresholds

  20. LASER Console as used in TI

  21. Right-click on alarms shows details: responsible, link to documentation, etc.

  22. LASER Alarm System • Alarms • Inform OP of problems that need human intervention • Can be raised by systems (e.g. FESA devices, timing, Accelerator Logging, SIS, …) • Can show DiaMon metrics that exceed Error Threshold (if configured to generate alarm) • Assessment • Very effective mechanism to inform OP about problems • Include responsible(s) to call and link to documentation of error • OP must invest time to look after alarms • Risk of too many alarms (screen full, cannot see all/new alarms) • Improvements/Actions • LASER 2 will introduce a procedure whereby OP must approve alarms • Effort needed to clean up alarm configurations (responsible + documentation) • Mode-dependent alarms exist already but must be configured • Sanity check and procedure needed for OP to assess that LASER itself works fine

  23. Case 3: Failures in monitoring-based systems

  24. Timing System Overview

  25. “Jvideo” for OP to see if timing works Alarm in LASER if no timing on Front-Ends

  26. Extraction API f CERN Accelerator Logging Service - Architecture Overview >500 registered users ~ 850’000 signals ~ 300 data loading processes ~ 4 billion records per day ~ 140 GB per day 50 TB per year stored Data Consumers >100 custom applications > 3 million extraction requests per day >20 Years filtered data 7 Days raw data PL/SQL filtered data transfer ~ 250’000 Signals ~ 16 data loading processes ~ 5.4 billion records per day ~ 275 GB per day  90 TB per year throughput Data Persistence PL/SQL API Data Providers

  27. General Case of Failures in Monitoring-based Systems • Examples: • FixedDisplay not updated, device not sending updates, PM data missing, data missing for SIS permit, … • Any GUI not updating! • Theoretically the diagnostic algorithm is simple: • Start with the system where the fault was detected • Find all other involved systems • Check each and find out which one is failing • Reality is challenging: • Need to know or find all involved systems • (LSA/InCA server, DB, Logging processes, Proxies, JMS Brokers, devices, FECs, Timing system, OS resources, …) • Difficult to diagnose if intermediate systems work

  28. New ACET Runtime Dependency Viewer (alpha) Still a lot of work… More in a TC early next year

  29. Checking individual systems if they are OK or not • Idea: Do the same checks as system expert, but automate and require less domain knowledge • System-specific sanity checks => OK/WARN/ERROR • Functional test: Execute typical service request and check result • Checking relevant metrics • Checking configuration • Examples of functional sanity checks (from headless clients) • LSA: load test settings, modify (trim) them, save to DB, revert • FESA: trigger test event, execute RT action, notify CMW client • JMS broker: send message through broker back to you, check delay • RBAC: login as test user and check reply (e.g. token + expected roles) • Database: check access and response time • http://abwww: periodically download file and check contents and delay • …

  30. Checking individual systems if they are OK or not • Examples of relevant metrics • Liveness counters (e.g. LSA Client requests, Accelerator Logging data flow, CMW updates to clients, FESA RT actions, Timing events, …) • Error indicators: error/exception rate; number of packets lost; too frequent Java Garbage collection • Checking configuration • Comparison between CCDB configuration and reality • Version Number of FESA SW, OS, Drivers, Firmware, HW(?), … • Based on configuration feedback from FECs to CCDB

  31. Checking individual systems – status and plans • Current situation • Sanity checks generally accepted as good idea, but very few teams have implemented them • Pioneers in this area: Accelerator Logging team • Ideas and promises from other teams (FESA, CMW liveness counters) • Infrastructure is ready to be used (DiaMon, JMX, CMX, config feedback) • Actions • Someone has to drive and coordinate this effort (me?) • System teams have to find time to implement sanity checks • Ultimate goal: sanity checks integrated in DiaMon(green=green, red=red)

  32. Tools for Advanced Operators or CO Special Shift Workers Easy browsing and correlation of all our logfiles “Taming” the expert diagnostic tools

  33. Easy browsing and correlation of Logfiles • Idea: • If a system has problems, there should be errors in the log files • If not, there shouldn’t be any • Current situation: our logfiles need to be improved • Expert information, often difficult to understand • Logfiles are often contain spurious “normal” error messages • Logfile (“tracing”) information from Linux Syslog and FESA/CMW servers is centrally collected on cs-ccr-tracing. Java not yet. • Splunk allows for advanced searching and correlation • Improvements/Actions • Introduce logging culture (e.g. don’t tolerate spurious “normal errors”) • Reduce log volume by filtering out endlessly repeated messages • Add Java logging to Splunk • Promote Splunkamong CO experts (Link from CCM. Training)

  34. Splunk Log Search Window

  35. “Taming” the Expert Diagnostic Tools • Every team has developed good tools and use them • CMW Admin tool (new functionality in RDA3) • Lemon system monitoring tool • Timing diagnostic tools (dtm_diag, mtg_diag, test_tgm, ctr_test, …) • FESA navigator + Expert diag tool • Java Mission Control (JConsole on steroids) • Driver test suites • Worldfip diagnostic tools • … • Improvement/Actions: • Make a user-friendly subset of the functionality available for 1st line diagnostics • Important to include feedback from CO special shift workers • Training?

  36. Acceptance tests before accepting to support systems • Insist that systems are ready for 1st line support • 1st line diagnostic tools and documentation must be available • Sanity checks and metrics + thresholds must be configured in DiaMon • Valid responsibles (egroups) must be registered for computers, controls processes, LASER alarms, etc • Restarting (wreboot/reboot) must be enabled for OP only if not risky • If restarting is available it must improve the situation • Good error messages in GUIs and/or in logfiles • No spurious error messages (e.g. “normal errors”) • Formal acceptance procedure to ensure the above criteria are met? • Systems that don’t meet criteria won’t get 1st line support (?) • No diagnostic attempts will be made • Experts will be called directly at any time

  37. Summary • New exploitation model requires better 1st line diagnostic tools • A lot of good expert diagnostic tools are available • But only few diagnostic tools are suitable for OP • A common effort from the CO different projects is necessary to contribute • 1st line support coverage only for systems that “deserve” it

More Related