450 likes | 911 Views
Advanced Event Management. Advanced Event Management Tivoli User Group November 2011 Nick Lansdowne. Agenda. Initial Growth of an Event Management Solution Advancing the solution: Event Enrichment Message Catalogue Automated Escalation. First Steps.
E N D
Advanced Event Management Advanced Event Management Tivoli User Group November 2011 Nick Lansdowne
Agenda • Initial Growth of an Event Management Solution • Advancing the solution: • Event Enrichment • Message Catalogue • Automated Escalation
First Steps • Growth of Event Management Solutions • Generate useful events • Base details: Hostname, Severity, Message • Event Type: Problem/resolution • Correlation information: Resource details • Noise Reduction: • De-duplication • Automated Clearing • Correlation of resolution to problem events Out-of-the-box with Netcool/OMNIbus
Basic Architecture Operations Teams WebGUI Server Present Events Netcool/OMNIbusObjectServer Process Events Collect Management Information ITM Probes ITCAM ITNM
Operations Team Event Response • Context • Who and what is affected? • Interpret • What does the event mean? • What response is required? • Inform • Who should respond to the incident? • How are these questions answered? • Personal Knowledge • SharePoint – Spreadsheets – Helpdesk system • CCMDB – On-call Rota – Knowledge Base
Automate and Integrate • Context • Interpret • Inform
Operations Team Response • Context • Who and what is affected? • Interpret • What does the event mean? • What response is required? • Inform • Who should respond to the incident? Enrichment Message Catalogue Automated Escalation
Enrichment Who and what is affected?
Who and what is affected? • Affected Service/Solution • Support Team, rota & contact details • Details for System Owner • Details for affected Customer(s)
How to present information? Event Enrichment Drill Through Web-page Impact Operator View: Pros: Simple integration, dynamic data access Custom Web-page: Cons: Proprietary development • Probe rules: • Pros: Efficient, enriched as close to source • Cons: Static data, content • OMNIbus Automation: • Pros: Self-contained solution • Cons: Static data, Overhead on ObjectServer • Impact Policy • Pros: Dynamic data, ObjectServer overhead minimised • Cons: Impact Infrastructure
Event Enrichment: Solution Basics Event received by ObjectServer and triggers an Impact Policy The Impact Policy identifies the node specific data in the CCMDB Additional data written to event in ObjectServer 1 1 2 3
Netcool/Impact Implementation • Data Source definition: • Connection to the CCMDB database • Data Type definition: • Identifies the table or view within the RDBMS • Policy definition: • Describes how events are processed, and what data is appended to those events • Event Reader definition: • Identifies which events are to be processed and which policies will be applied to those events
Operator View: Solution Basics Event received by ObjectServer and displayed in AEL Operator initiates custom tool that launches to a Impact Operator View Operator view queries CCMDB for data to display 2 2 1 1 2 3
Implementation • Data Source definition: • Connection to the CCMDB database • Data Type definition: • Identifies the table or view within the RDBMS • Operator View: • Describes what additional data is displayed along with the event details • WebGUI Tool: • Alerts tool to launch into Impact Operator View
WebGUI Tool http://impact511:9080/opview/displays/NCICLUSTER-NodeDetails.html?Node={@Node}&Severity={@Severity} &Summary={@Summary}&node={@Node}&Location={@Location}
Message Catalogue What does the event mean? What response is required?
What does the event mean? • Details: • Description • Source • Impact • Required Action • Owner • Data Access: • Enrichment • WebGUI Tool • Data Source: • File system • Database • Sharepoint • Wiki based solution
Orb Data Message Catalogue • Orb Data solution: • Open source wiki engine (www.dokuwiki.org): • Benefits • All data is stored in plain text files (no database) • Centralised data – removes risk of obsolete documents and no distribution of document revisions • Pages can be quickly and easily created and updated • Page templates can be used to speed up page creation and promote standards • Supports images, PDFs, Word documents etc to supplement page content • Search capability to quickly find specific and related pages • Page revisions tracked and previous versions stored
Typical Implementation • URL Launched from WebGUI Tool: • URL derived from alert fields, for example: • AlertKey, Node, Identifier and populated via probe rule • Alternatively use Impact policy to reference external data source
Extending the Message Catalogue • Use of HTML based pages: • capability to embed additional information • For example, email links • Embedded Operator Views • Ability to act on contained data
Automated Escalation Who should respond to the incident?
Who should respond to the incident? • Automated Escalation: • Which events? • How? • Visual: Event Flash, Increment severity • External: email, SMS, page • When? • Rota hours • Who to? • Contact details for support team/on-call engineer • Requires: • Repository for Rota & Contact details • Automations
How to escalate? OMNIbus Automation Impact Policy Repository: Integration to existing repositories Automation Impact Policies Pros & Cons Pros: No duplication of data, DSA Integration, ObjectServer overhead minimised Cons: Impact Infrastructure • Respository: • Custom ObjectServer Database • Automation: • Triggers • Procedures • Pros & Cons: • Pros: Self-contained solution • Cons: Duplicated data, Overhead on ObjectServer
Escalation: Solution Basics Event received by ObjectServer and triggers an Impact Policy The Impact Policy retrieves escalation details from CCMDB & Helpdesk system Email sent to on-call engineer Update event to indicate escalation 3 1 1 2 4
Impact Implementation • Data Source definition: • Connection to the CCMDB & Helpdesk databases • Data Type & Item definition: • Identifies and links the tables/views for the require external data • Policy definition: • Describes how events are processed, the data correlation between events and repositories and initiates the escalation • Event Reader definition: • Identifies which events are to be processed and which policies will be applied to those events
Example Raw data Support Team Support Rota
Policy: Sending an email • Configure the EmailSender service • SMTP Server • Sending email address • Call the sendEmail function: • Including: Target email address, Subject, Message
Extending the Solution • Closed loop escalation: • Feedback from escalation • Mechanism for feedback? • Standard: Event Acknowledgement from WebGUI/Native Desktop • External: Email, SMS • External Solution • Derdackmessage master Enterprise Alert 2011
Derdack Enterprise Alert 2011 • Automates escalation process: • Evaluation of event • Automated search for a responsible person or group • Communication and submission of escalation messages • Processing of delivery notification • Processing of responses • Communication back to the initiating system • Closed Loop Escalation via: • Email, SMS, Automated VOIP Calls, Smartphone Applications • Out-of the box Integrations: • ITM 6.2+, HP Operations Manager, SCOM • Custom Integration: • SOAP, HTTP (may be used for Netcool integration)
Extended Architecture Operations Teams Closed Loop Escalation Event Details/ Tool Integration WebGUI Server Message Catalogue EA2011 Present Events Enrichment/ Escalation CCMDB Netcool/OMNIbusObjectServer Netcool/Impact Process Events Collect Management Information ITM Probes ITCAM ITNM
Summary • Base infrastructure: • Event generation, deduplication, correlation • Event Enrichment: • Basic context for event • Message Catalogue: • Event details and escalation information • Automated Escalation: • Continuous 24x7 service
Orb Data Services • Netcool • TEC to Netcool/OMNIbus Migration Review • Remote Rulebase Migration • Mobile Device Integration/Event Workflow Design • Meet SLAs with Automated out-of-hours escalations • Netcool/Impact – A Practical guide (workshop) • IBM Tivoli Monitoring • ITM 6.2.3 Migration • Predictive Performance and Capacity Monitoring • Custom Agent Development • SLA reporting using TDW • Monitoring MQ infrastructures • WebSphere application and infrastructure monitoring • Tivoli Automation • Implementing Security to minimise risk and overheads