290 likes | 441 Views
BSM (OMi) 9.2x Stream-based Event correlation Troubleshooting. Agenda. SBEC – general feature Overview. WHAT is Stream-Based Event Correlation ?.
E N D
BSM (OMi) 9.2x Stream-based Event correlation Troubleshooting
WHAT is Stream-Based Event Correlation ? Stream-based event correlation (SBEC) uses rules and filters to identify commonly occurring events or combinations of events and helps simply the handling of such events by automatically identifying events that can be withheld, removed or need a new event to be generated and displayed to the operators. • The following types of SBEC rules can be configured: • Repetition Rules: Frequent repetitions of the same event may indicate a problem that requires attention. • Combination Rules: A combination of different events occurring together or in a particular order indicates an issue, and requires special treatment. • Missing Recurrence Rules: A regularly recurring event is missing, for example, a regular heartbeat event do not arrive when expected. • SBEC Rules are processed in the order defined in the rules list. Modifications are executed as soon as the rule is matched, and subsequent rules see modifications done by earlier rules
Combination Rules • When a combination of events occur, sometimes in a precise order, within a short period of time, this may be understood as a problem requiring corrective action or even as a scenario that may initially appear to be a problem but which does not require any intervention by an operator. For example, a node-down event followed by a node-up event within 2 minutes usually means that a system reboot has occurred. This is typically viewed as not significant, as long a reboots do not occur too frequently, and does not require action other than the automatic cleaning up of these events. • Configuring a combination rule requires at least two filters to select the events to consider, for example, to select events with a node-down indicator and to select events with a node-up indicator. Certain attributes must be the same to be regarded as originating from the same source, for example, the node CI and source CI must be the same. The time interval between the related events must be short, for example, a maximum of five minutes, before the scenario is considered to be a problem. You can also specify if the events must occur in a particular order for the rule to be matched and executed. • It may be considered advantageous to hold back matching events during the time interval to reduce the number of unnecessary events being sent to the Event Browser. Only when the required combination of events are received within the specified time period is it necessary to inform the operator that action is necessary. This could be to close or discard all events, or modify the last event to inform that a reboot has taken place. Alternatively, a new event can be automatically generated. All matching events can be relate to the new event as symptoms.
Missing Recurrence Rules • Events are sometimes regularly generated to inform that no problem has occurred, for example "alive" events indicate that a system is running. As soon as the expected regular event is not received, it can be assumed that there is a problem, for example, If a system stops reporting “alive” events every 10 minutes, it is has probably stopped running. • Configuring a missing recurrence rule requires a filter to select the events to consider, for example, to select events with "node alive" in the title. Certain attributes must be the same to be regarded as originating from the same source, for example, the node, CI and source CI must be the same. The time allowable interval before an expected event is considered to be missing must be specified, for example, a maximum of 10 minutes in our example. • It may be considered advantageous to discard recurring events to reduce the number of unnecessary events being sent to the Event Browser. • When the expected event is not received within the specified time period, is it necessary to inform the operator that action is necessary.A new event can be automatically generated. All matching events can be relate to the new event as symptoms.
Repetition Rules • The repeated generation of the same event may indicate a problem. For example, more than 10 login failures for the same account within 2 minutes is typically viewed as requiring action and should create a security alert. • Configuring a repetition rule requires a filter to select the events to consider, for example, text "login failed" is contained within the title. Certain attributes must be the same to be regarded as originating from the same source, for example, the host name of the system and the user name being used to log in must be the same. The time interval between login attempts must be short, for example, a maximum of two minutes, and there must be a minimum number of attempted failed logins before the scenario is considered to be a problem. • It may be considered advantageous to hold back matching events during the time interval to reduce the number of unnecessary events being sent to the Event Browser. Only when the minimum number of attempted failed logins exceeds the specified threshold, is it necessary to inform the operator that action is necessary. This could be to close or discard the failed login events, except for the last event which is modified to inform of the series of failed logins. Alternatively, a new event can be automatically generated. All failed-login events can be relate to the new event as symptoms.
Repetition Concept Purpose: Event Repetition indicates a problem Example: More than 3 Reboots within 1 hour shall create a critical event “Node rebooted” 1 3 2 t Time Interval
Combination Concept Purpose: Handle a combination of events a certain way Example: When a node is down, events about failed SiS monitors should be related to the node down event “SiS monitor failed” 2 “TCP timeout occured” 1 3 “Node down” 4 t Time Interval
Missing Recurrence Concept Purpose: Detect that regularly-received events are no longer arriving Example: For auditing and compliance purposes, detailed health data and statistics are collected every day using events. If these audit events do not arrive, a critical event should be sent ? ? A 1 2 t
Rule processing How SBEC engine works Only when receiving a new event: For each Rule… in the order defined, all input filters are checked if they match the incoming event On every match of an input filter, a query is executed to check whether all conditions of the corresponding rule are matched • Repetition: enough events received within time frame • Combination: at least one event for every filter (“event set”) received within time frame If all conditions are matched, the Actions configured in that rule are executed with immediate effect on all corresponding events
Multiple Sbec Rules Any number of Repetition, Combination, and Missing Recurrence Rules can be created Processed in defined order (visible to the user, configurable) Can be chained together • First rule that triggers can modify events (e.g. close, discard, create new) • Next rule in line will see event modifications Can filter for the same events (even use the same filter)
Effect of hold back when multiple rules match the same events Note 1: If at least one rule is holding back an event, it‘s held back • Even if another rule is not holding it back Note 2: There is one holding area for all rules Example Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead Rule 2: combination rule: looking for node down & SiS events – not holding back the events Result: node down is hold back as long as within time window of Rule 1 (and as long as it is not released by any other rule)! Holding area – stored in DB if BSM server is stopped, but no persistency in case of unnatural abort of opr-backend
Effect of release when multiple rules match the same events Note 3: When a rule triggers, all the corresponding input events are removed from the holding area • Even if another rule put them there • Why? The rule that triggered detected a certain situation where the input events are relevant and therefore it can be seen as the master of these events. It has the right to release or even discard them. Note 4: If no rule was holding back an event, release has no effect Example • Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead • Rule 2: combination rule: looking for node down & SiS events – not holding back the events • Rule 2 triggers after node down & one SiS monitor event was received. Releases events. • Result: node down is no longer held back and correlated with SiS event. • Note: Rule 1 might still trigger later and create the reboot event!!
Effect of discard if possible when multiple rules match the same events Note 4: Discard if possible will only have an effect if event is still in holding area • If no rule was holding the event back or if another rule already triggered and released the event, discard will have no effect (but the close operation is executed) • If discard is possible, event will be deleted immediately. For other rules, it will look like as if event never arrived. Example • Rule 1: combination rule: looking for node down/node up events – holding back node down as it wants to discard it and create reboot event instead • Rule 2: combination rule: looking for node down & SiS events – not holding back the events • Rule 2 triggers after node down & one SiS monitor event was received. Releases events. • Rule 1 triggers: wants to discard node down event, but this is not possible as it was already released by Rule 2
Gotchas & Best practices Gotcha • It‘s quite easy to create a simple repetition rule like this: • repetition rule uses filter title contains „rebooted“ • and creates new event with title: „system <node> rebooted 10 times in 2 hours“ • Guess what happens... Best Practices • In a rule don‘t create events that match the input filter of the rule • Include check for event state in filter - look for non-closed events only • Avoid too generic filters (like contains „rebooted“) • Add custom attribute (e.g. „SBECcreated=true“) and checks for it if you want to avoid that a created event is processed by following rules • If possible, avoid matching the same events. If unavoidable, make sure you understand the hold/release/discard behavior • When you reuse CI Hint in „Create New Event“, also reuse Node Hint.
Event suppression Purpose: All events matching a filter will be discarded from the event pipeline Example: OMi is receiving unimportant events from data source that is not under control of OpsBridge organization – can’t be filtered out on source level Configurable by event suppression rules consisting of • Event Filter • Name • Description • Enable/Disable Suppression rules are processed in the event pipeline at an early stage • Right after the resolution step, before Post-Resolution-EPI • no further processing occurs, events will be lost and not stored in the OMi DB.
Event history An event has changed and you have no idea why? Check the event history • Contains information about user / component, that changes event properties Common Source for unexpected changes on events: Event Forwarding & Back-Synch
Logging / debugging Server: DPS Process: opr-backend Log config to enable log level “DEBUG”: /<HPBSM_root>/conf/core/Tools/log4j/opr-backend/opr-backend.properties Log files: • /<HPBSM_root>/log/opr-backend/opr-backend.log default location for all logging within this process) • /<HPBSM_root>/log/opr-backend_boot.log for more severe issues, e.g. unhandled Exceptions, everything dumped to stdout/stderr)
How to TRACK ans SBEC RULE as EVENTS ARRIVE 1. Debug opr_backend.log 2. Make sure event is arrived, you will see an error like this 2013-01-17 05:51:16,726 [Thread-44] DEBUG EventChannelCiResolver.logEvent(309) - resolving event: SBEC(01b0ceb8-8189-4fdb-ae74-cf38b698b6d9), nodeHints=bsm92, relatedCiHint=bsm92, service_id=null 3. Make sure event matches SBEC rule 2013-01-17 05:51:09,951 [Thread-44] DEBUG EventStreamCorrelator.evaluateEventInRule(95) - Event matches filter in rule 'SBEC 3 CRITICAL EVENT RULE' 2013-01-17 05:51:09,951 [Thread-44] DEBUG FilterConfigManagerImpl.getFilterConfig(100) - get filter configuration with id: 93c74ff4-8cd0-463a-899b-2ffd41658d0f
4. SBEC ruLE will be evaluated 2013-01-17 05:51:09,958 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 1 results 2013-01-17 05:51:09,958 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@bfa28a9[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@54837563[c43278af-044d-b827-ef0f-484e086159b7,[84ee3e3d-e19f-4f39-818f-c70be3746550]]]] 2013-01-17 05:51:09,958 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 1 of 3 events collected 2013-01-17 05:51:09,958 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations 2
5. As second event Arrives within TIME FRAME LISTED , make sure it is stored 2013-01-17 05:51:13,560 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 2 results 2013-01-17 05:51:13,560 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@5f5fc212[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@7be5ca9[c43278af-044d-b827-ef0f-484e086159b7,[999d66c6-08cd-4b20-ac2c-6fc5dd19c11e, 84ee3e3d-e19f-4f39-818f-c70be3746550]]]] 2013-01-17 05:51:13,560 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 2 of 3 events collected 2013-01-17 05:51:13,560 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations
6. CHECK MAKE SURE 3rd EVENTS ARRIVE 2013-01-17 05:51:16,753 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(117) - Found 3 results 2013-01-17 05:51:16,753 [Thread-44] DEBUG SbecRuleEvaluatorImpl.findSbecInstances(130) - New SbecInstance: com.hp.opr.common.streamcorrelation.result.SbecInstance@3dc63d85[RuleId=89299047-5a85-4755-945e-43e5b9ab837a,MatchedEvtSets=[com.hp.opr.common.streamcorrelation.result.MatchedEventSet@21f10672[c43278af-044d-b827-ef0f-484e086159b7,[01b0ceb8-8189-4fdb-ae74-cf38b698b6d9, 999d66c6-08cd-4b20-ac2c-6fc5dd19c11e, 84ee3e3d-e19f-4f39-818f-c70be3746550]]]] 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 3 of 3 events collected !
7. Once it matches RULE, It will execute ACTIONS SPECIFIED 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.resultMatches(82) - Repetition Scenario evaluated: 3 of 3 events collected 2013-01-17 05:51:16,754 [Thread-44] DEBUG QueryResultProcessorImpl.processResult(53) - Rule 'SBEC 3 CRITICAL EVENT RULE' matches. Now executing Actions! 2013-01-17 05:51:16,754 [Thread-44] DEBUG BSMConnectionProvider.logOpenConnection(188) - Connection has been retrieved from pool. Number of borrowed connections is now: 1
8. NEW SBEC EVENT GETS CREATED 2013-01-17 05:51:16,811 [Thread-44] DEBUG PipelineEventPoolImpl.insertNewEvent(330) - New event being inserted into the pipeline: com.hp.opr.common.model.Event@c8dbe6a[dbf5cae8-7a31-4093-8853-1bf208a100f7,1,SBEC received 3 Critical Evenst in aminute,<null>,OPEN,CRITICAL,<null>,<null>,<null>,<null>,bsm92,<null>,<null>,<null>,com.hp.opr.common.model.ResolutionHints@2dd02796[bsm92,<null>,<null>,<null>],com.hp.opr.common.model.ResolutionHints@3cd70059[<null>,<null>,<null>,<null>],<null>,<null>,false,-1,-1,[],{},Thu Jan 17 05:51:16 MST 2013,<null>,Thu Jan 17 05:51:16 MST 2013,0,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,false,<null>,<null>,<null>,<null>,<null>,<null>] 2013-01-17 05:51:16,811 [Thread-44] DEBUG EventPipeline.reinsertEvent(440) - Event dbf5cae8-7a31-4093-8853-1bf208a100f7 is now waiting for reinsertion at step PipelineEntry 2013-01-17 05:51:16,811 [Thread-44] DEBUG EventUpdater.storeCorrelations(247) - Storing correlations
9. TO Troubleshoot , JUST KNOW STEPS and how it works If any of above steps fails it will give you a reason why in opr-backend.log ( DEBUG MODE) To find corrupt people follow the Money, to find non working SBEC events follow the EVENT through opr_backend.log