550 likes | 846 Views
Leveraging Ionix IT Operations Intelligence to make your Private Cloud more efficient using APIs. Bill Kuhhirte. “. “. It is a very sad thing that nowadays there is so little useless information. - Oscar Wilde. EMC’s Vision Begins with your Core Asset. Agenda.
E N D
Leveraging Ionix IT Operations Intelligence to make your Private Cloud more efficient using APIs Bill Kuhhirte
“ “ It is a very sad thing that nowadays there is so little useless information. - Oscar Wilde EMC’s Vision Begins with your Core Asset
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Smarts Founding Vision • Automate the management of dynamic distributed systems • Management by delegation • Model based management • Patented technology builds intelligence into software that automatically adapts to managed system
Value of Automated Root Cause Analysis Up to 80% of the time to resolvea service affecting failure can beattributed to finding the source Accelerate resolution Lower operational costs Terminology MTTI = Mean Time to Identify – the time to identify the cause of the incident MTTF = Mean Time to Fix – the time to actually restore service once the cause is isolated MTTR = Mean Time to Resolution Escalate Identify Restore 55 min 20 min 75 min 5 min 25 min MTTI MTTR MTTF The Impact of Automated Analysis 67% MTTR Improvement!
Current IT Management State OperationalGap AutomationGap TechnologyGap ManagementInformation Gap Business Gap
“Automating the Automatable” Analyze in Context automatically analyzes any behavior, in any technology domain Integration Layer automatically builds a knowledge base across infrastructure and business Collect • Auto-discovery • Mediation • 900+ certified devices • Adapters
Doing More with Ionix IT Operations Insight • Produces an “actionable” root-cause that can be refined by • Identifying at a more granular detail what is wrong • Identifying based on similar events what was done to resolve them • Actually taking an action (automated or user-directed) to resolve the problem or gather additional information • Focused only on the technology domains and specific problems • There will always be new problems to solve • Some problems are specific to a very narrow set of environments or configurations • Some problems are simply hard to diagnose in a generic way (e.g. firmware bugs, transient conditions) • By having a flexible framework • Analysis can be adaptive to your needs • Automation can be applied to reduce personnel costs/time • Can rapidly address new problems without waiting for new releases from EMC
Global Network Solution Architecture NPM IPAM/PM MPLS Ionix CMDB Root Cause SAM & BIM Business Impacts Cross Correlation as Applicable Topology Topology Discovery & Monitoring SNMP, Syslog, SSH and Telnet SNMP, ICMP & Traps SNMP & EMS Managed Domain EIGRP Firewalls MPLS Servers BGP Routers Switches IS-IS OSPF
Definition of Terms • Repository • The Repository is an in-memory database representing the topology constructed automatically by applying behavior models to the discovered infrastructure. It represents physical and logical objects in the managed environment and their relationships and is used to compute problem signatures for the Codebook. • MODEL • Managed Object Definition Language (MODEL) is a language used to express the logical and physical relationships between components of the topology as well as how symptoms propagate across those relationships from the problems they relate to. • ECIM • The Repository leverages the industry-standard Common Information Model defined by the DMTF, and is the first commercial implementation of this important standard. The EMC Ionix implementation of this model is called the EMC Common Information Model (ECIM). It provides a single common topological context for all of the EMC Smarts analysis tools as well as events received from 3rd party tools. This means that when an operator receives a notification of a problem they can rapidly view all the current problem information for the device regardless of the information source. The infrastructure devices and their components are also related to the logical topologies that are overlain on the physical topology. This permits impact analysis to extend to customers, business processes, geographies, etc.
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Dynamic Model • Dynamic Model is an alternate implementation of the Managed Object Definition Language (MODEL) • Traditional Model produces executable code • Dynamic Model produces a platform- and (mostly) version-independent output • Languages and semantics are identical, with some minor limitations • Dynamic Model enables you to add new classes, and refine classes that are already defined in the data model libraries without needing the sources. • Dynamic Model can add attributes, events, and relationships to an existing class • New attributes are saved and restored with the repository • Load dynamic model extensions into IP and SAM servers • Populate attributes via ad hoc scripts, discovery scripts and SNMP polling
Use Case #1: Configuring Multiple Thresholds • Example in point: • Provide 4 thresholds for FileSystem utilization with different severities • Overview of the solution: • Dynamic MODEL code to configure multiple events • Dynamic MODEL code to adjust the UI • Assign severities in SAM for the new events
Use Case #1: Configuring Multiple Thresholds • Dynamic MODEL to generate new events: • Define and Export the events • Provide default threshold values • Solution: refine interface FileSystem_Performance { export ModerateUtilization85PercentMarkerExceeded; export ModerateUtilization90PercentMarkerExceeded; event ModerateUtilization85PercentMarkerExceeded "Utilization is higher than Utilization85PercentMarker and less than Utilization90PercentMarker." = Mounted && StorageSize > 0 && UtilizationPct > Utilization85PercentMarker && UtilizationPct <= Utilization90PercentMarker; event ModerateUtilization90PercentMarkerExceeded "Utilization is higher than Utilization90PercentMarker and less than Utilization95PercentMarker." = Mounted && StorageSize > 0 && UtilizationPct > Utilization90PercentMarker && UtilizationPct <= Utilization95PercentMarker; attribute double Utilization85PercentMarker "Threshold for percentage of total size currently in use." = 85.0; attribute double Utilization90PercentMarker "Threshold for percentage of total size currently in use." = 90.0; }
Use Case #1: Configuring Multiple Thresholds • Dynamic MODEL to adjust the UI: • Provide an attribute (ranged) for each of the thresholds • Solution: • Posted to https://community.emc.com/message/458163 refine interface FileSystem_Performance_Setting { attribute double [0 .. 100] Utilization85PercentMarker "The lower threshold for moderate filesystem utilization expressed as a " "percentage of the total capacity of the filesystem." = 85; attribute double [0 .. 100] Utilization90PercentMarker "The higher threshold for moderate filesystem utilization expressed as a " "percentage of the total capacity of the filesystem." = 90; … }
Use Case #2 – Site Failure Analysis • Failures of a physical location (Rack, Floor, Building, etc.) • Desirable RCA – especially in areas with power or cooling issues • Easy to perform, but based on undiscoverable data • Solution posted to https://community.emc.com/message/458163 interface Site : ICIM_Collection { export SiteDown; event SiteDown "The site is down." = IsSiteDown && (|ICIM_UnitaryComputerSystem(ConsistsOf)| > 0); propagate attribute boolean and AllUnresponsive = ICIM_UnitaryComputerSystem, ConsistsOf, IsUnresponsive; propagate attribute boolean or SuperSiteDown = Site, MemberOf, AllUnresponsive; computed attribute boolean IsSiteDown = SuperSiteDown ? FALSE : AllUnresponsive else AllUnresponsive; … };
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Use Case #3 - Maintenance • IT Departments normally have scheduled component or service outages • Want the Operations staff to ignore those conditions • Need the alarms to be visible again if the component or service is still unavailable after the planned window. • Ionix SAM 8.0 and higher provides a good mechanism for handling scheduled outages • Provided through MBIM (Maintenance and Business Impact Manager) • The GUI exposes ways to configure scheduled maintenance for topology objects in SAM • So, what if the object doesn’t exist in SAM? • More granular pieces: • Network Adapters • FileSystems • TemperatureSensors • Comes from an abstract event source
Maintenance – API Accessibility • The Scheduled Maintenance *is* an event and can be created using our standard APIs • Can even be driven by a configuration file • Could even be coupled to a server-side tool in SAM as a way to suppress alarms for a fixed duration • Solution posted to https://community.emc.com/message/458163 notiName = "NOTIFICATION-" . systemClass . "_" . systemName . "_SchedMaint" . currentTime; schedMaintNotiObj = create("ICS_Notification", notiName); schedMaintNotiObj->ClassName = "Interface"; schedMaintNotiObj->ClassDisplayName = "Interface"; schedMaintNotiObj->InstanceName = ifName; schedMaintNotiObj->InstanceDisplayName = ifName; schedMaintNotiObj->EventType = "MOMENTARY"; schedMaintNotiObj->EventText = "Sched maint from: " . time(schedMaintTimeValue) . " to: " . time(maintEndTime) . ", by EXTERNAL"; schedMaintNotiObj->EventName = "SchedMaint"; schedMaintNotiObj->EventDisplayName = "SchedMaint"; schedMaintNotiObj->OccurredOn = systemObj; schedMaintNotiObj->Severity = 5; schedMaintNotiObj->ClearOnAcknowledge = TRUE; schedMaintNotiObj->notify("maint", "EXTERNAL", "Maint Window Created from External Source", currentTime, schedMaintDuration); schedMaintNotiObj->takeOwnership("maint"); schedMaintNotiObj->changed();
Use Case #4 – Business Impact Management • What is Maintenance & Business Impact Manager (MBIM) • A method of implying service impacts based on relationships to topological components • Allows the Operations staff to prioritize simultaneous problems based on business impact • Handles the creation and manipulation of scheduled maintenance windows for components. • Perceived limitations: • Service impacts can be calculated only against topology in SAM • Any event regardless of severity triggers the service impact • Service impact can vary depending on when the problem occurs • All of those can be overcome through the use of the API (and a little creativity)
Notification List Processing - Term Overview • Notification List • A subset of the overall set of alarms (notifications) within SAM based on the application of a filter. • Notification List Subscriber • An adapter using any of the forms of API which will be sent indications of change to the notifications within a Notification List • The adapter may then perform any number of actions based on the reception of that data • Output the data to another interface • Manipulate the notification • Perform specific user-defined actions • For more information on how to construct a Notification List subscriber take a look here: https://community.emc.com/docs/DOC-1268 • Support exists in all flavors of the API
MBIM/SAM Subtleties • MBIM is driven by a Notification List: • In $SM_SITEMOD/rules/bim/bim-start-sam-sync.asl: BUSINESS_IMPACT_SUB { } do { … sub = create("GA_NLSubscription", bimDriverName."-SUB"); sub->NLName = "ALL_NOTIFICATIONS"; subscriberFE->SubscribesTo += sub; … } • So, you can change the set of notifications subscribed to that will drive business impacts by changing the NL and applying your own filters • Sidebar: What if you wanted to be granular beyond what you can express in a filter?
Sidebar: ASL Notification List Filter • Instead of using the typical NL filter construction (using the UI or XML) we can use ASL: • Option to use an ASL filter is only available when creating the Notification List filter • The use of ASL can make a filter arbitrarily complex • The variable “Result” must contain a Boolean value indicating whether the event passes the filter or not • Sample code is below ($SM_HOME/rules/ics/nl-sample-filter.asl): default Result = TRUE; default NotificationName = ""; START do { notification = object(NotificationName); if (notification->EventName == "Failure") { Result = TRUE; } else { Result = FALSE; } }
Use Case #4 – Business Impact Management • Key Points: • Example is done via hook script, but can be easily done through a NotificationList subscriber • Utilizes general (string) key/value tables of the class “GA_StringDictionary” • Not persistent data – will be reloaded every time SAM restarts • Map is based around the ElementName but can be adapted to any field of the Notification • Example code shows the creation for a single Customer, but the data may have any number of business impacts • Solution posted to https://community.emc.com/message/458163
Use Case #4 – Business Impact Management timeNow = time(); notiObj = notiFactory->makeNotification("Customer", custKey, "ServiceImpacted"); notiObj->ClassName = "Customer"; // Instance class must be set notiObj->SourceDomainName = eventObj->SourceDomainName; notiObj->Severity = eventSeverity; notiObj->EventType = "DURABLE"; //Set the event to autoclear based on duration notiObj->EventText = "Customer ".customerName." impacted by device ".keyElementName."::".eventText; notiObj->Category = "IMPACT"; notiObj->CausedBy += eventObj; notiObj->Impact = numeric(custWeight); notiObj->InstanceDisplayName = custKey; notiObj->InstanceName = customerName; notiResult = notiObj->notify("", "", "", timeNow); notiResult = notiObj->changed(); notifInstance->Causes += notiObj;
Use Case #4 – Business Impact Management • Sample Code Continued: • Similar processing required when clearing • Remember to call changed() to indicate to all clients that the Notification has been altered. // Loop through the list of Customers associated to the specified component // if (debug) { print("devCustList = ".devCustList); } foreach custKey (devCustList) { if (debug) { print("Customer Key : ".custKey); } notifKey = "NOTIFICATION-Customer_".custKey."_ServiceImpacted"; custNotif = self->object(notifKey); if (custNotif->isNull()) { print("WARNING: Missing Customer Notification ".notifKey. " for Device Mapping GA_StringDictionary DSLAM_AD. Device =“ .keyElementName); }else { notifResult = custNotif->clear("", "", ""); notifResult = custNotif->changed(); } }
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Actions within ASL/Java/Perl/C++ • Say you have a custom Notification List subscriber built • Want to be able to do more with it? • Suite of actions available: • Executed as a MODEL method, inherently language and mostly platform independent • ACT_SNMP • Send a trap/traps/informs • Request data via get or getNext • ACT_ICMP • Ping the designated target IP or system • ACT_Mail • Send a SMTP mail message • ACT_Script • Execute a script (run) on the server within the Ionix directory structure and return an integer value • Execute the script (run_ex) and return both a result (integer) as well as any text (stdout) • ACT_Perl • Similar to the above, but invokes Perl natively • Imagine the possibilities! • Feedback loops for collecting additional data for the audit log
Use Case #5 – Script Actions in SAM SAM & BIM NL Subscriber Add to audit text Results ACT_Script run_ex() Discovery & Monitoring VI-SDK Expect EIGRP Firewalls MPLS Servers BGP Routers Switches IS-IS OSPF
IPAM/PM Use Case #5 – Script Actions using SAM and IP SAM & BIM NL Subscriber Add to audit text Results ACT_Script run_ex() Discovery & Monitoring Expect EIGRP Firewalls MPLS Servers BGP Routers Switches IS-IS OSPF
Use Case #5 – Script Actions in SAM • General Code Overview • After a notification is received, create an ACT_Script object (if one doesn’t already exist) • Should have one per NL subscriber • Invoke the ACT_Script and passing parameters: • readonly script_result_t run_ex(in string parameters = "", in string stdindata =""); • The “parameters” argument describes a space delimited set of arguments to be passed • For example “--version --output” • The stdindata is a string passed to the stdin of the process created to run the script • Retrieving results • Results are returned in a data structure struct script_result_t { int result_code; string result_text; }; • The data structure is returned as a list in ASL • Interpretation of the results can then interact with the domain manager • Note: This implementation is a single thread, but you can launch new processing agents by calling GA_Driver::start() or startWithParameters() with waitForCompletion set to FALSE • Solution posted to https://community.emc.com/message/458163
Use Case #6 – Combining RCA with Abstract Events • Sometimes we can have abstract events that we want to combine with root-causes • Perhaps to perform accounting of various data sources • Perhaps you simply want to have an explanation tree • Approach is based around three key points: • SAM considers any two sources that share the same triplet (class, instance, event) are the *same* event • The use of Aggregate Notifications • Events can be explained in SAM even if the source is not the same as the explanation • Domain A presents problem X causes event Y, but Y is not subscribed • Domain B presents event Y • SAM will indicate X causes Y even though they are in different domains • Aggregate notifications • Active if one or more related notifications are active • Severity is the maximum of the related notifications
Use Case #6 – Combining RCA with Abstract Events Smart Adapter Platform (OI) IPAM/PM Ionix CMDB Root Cause SAM & BIM Business Impacts Discovery & Monitoring SNMP, ICMP & Traps Managed Domain EIGRP Firewalls MPLS Servers BGP Routers Switches IS-IS OSPF
Use Case #6 – Combining RCA with Abstract Notifications • Specific case details: • Specific alarms that indicate Network Adapters problems • Don’t want or need to know what the “TRUE” RCA happens to be • Configured a filter to receive just those messages in a Notification List subscriber • In SAM: • AM produces some RCA • RCA explains NetworkAdapter_Fault::<instance>::DownOrFlapping • SMART Adapter Platform receives events • Aggregates the events to the same class, instance, event as the explained symptom • Causal links are formed: • RCA->explains->Aggregate->aggregates->abstract notifications • Solution: • Posted to https://community.emc.com/message/458163
Use Case #6 – Combining RCA with Abstract Notifications NL { type: { "NL_NOTIFY" | "NL_CHANGE" | "NL_CLEAR" | "NL_DELETE" } fs classDisplayName: word fs instanceDisplayName: word fs eventDisplayName: word fs localPropObjectName: word .. eol } do { // locate properties object and extract true C:I:E localPropObj = self->object( "ASL_NLData" , localPropObjectName ) ; instance = localPropObj->get( "InstanceName" ) ; class = localPropObj->get( "ClassName" ) ; event = localPropObj->get( "EventName" ) ; icsNotificationFactory = object( getInstances( "ICS_NotificationFactory" )[0] ) ? IGNORE ;
Use Case #6 – Combining RCA with Abstract Notifications Cont. AggClassName = "Network_Adapter_Fault" ; AggInstanceName = instance ; AggEventName = "DownOrFlapping" ; eventObj = icsNotificationFactory->findNotification( class , instance , event ) ; aggObj = icsNotificationFactory->makeAggregate( AggClassName , AggInstanceName , AggEventName , eventObj ) ? IGNORE ; admin = "admin" ; OIDomainName = "INCHARGE-OI" ; aggObj->notify( admin , OIDomainName ) ; aggObj->changed() ; }
Use Case #6 – Classic ACM Example Aggregated Notifications Raw Symptoms Aggregation IP Server Performance Manager UnitaryComputerSystem::ResourceException Host Monitoring Software ESX Performance Data Causes ACM Internal Polling SoftwareService::Major/Minor/DegradedSymptoms Host Monitoring Software ESX Performance Data Causes VMWare AppSpeed ApplicationTaskCheck::Degraded Cisco Netflow Synthetic Transaction Tests
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Notification Manager – Operational Challenge • Not all notifications are associated with hard failures • Need to reduce time and effort associated with events • Some ‘sympathetic’ alarms become root cause problems • Reoccurring notifications can indicate future problems • The sheer volume of notifications are overwhelming • Manually customized scripting is complex and inefficient Customers of EMC Ionix want/need an effective solution to analyze unmanaged alarms
Notification Manager – Key Values • Converts unmanaged notifications into meaningful information • Eliminates the need for manual event scripting and rules writing • Improves event processing significantly • Allows for easy, modular distribution of new event-handling policies • Tracks and documents event policy changes automatically • Adapts to a wide variety of event sources
Notification Manager - Sample Capabilities • Is-Managed check • In-Maintenance • Calculated values for any field operators • Hook scripts • Clears-For (uses NCI & ECI) • Delayed Publication • Aggregation • Causes/CausedBy Relationship Support • Active/Inactive check-box • Expiration Clearing (lifetime of event) • Unknown Agent (create or ignore) • Logging specifications • Notification field setting • Enumerated value mapping • De-duplication • Time-based threshold • Dynamic-discard flag
Agenda • Brief Technology Overview and Definition of Terms • Points of Extensibility • Dynamic MODEL • Use Case #1 – Configuring Multiple Thresholds • Use Case #2 – Site Failure Analysis • Business Impact and Maintenance • Use Case #3 – Maintenance • Use Case #4 – Business Impact Management • SAM Automatic Actions and Notification List Subscribers • Use Case #5 – SAM Actions • Use Case #6 – Combining RCA with Abstract Events • Notification Manager • Use Case #7 – Advanced Event Management with Notification Manager • Recap • Questions
Recap • You should now have a good understanding of: • The automation capabilities in the Ionix IT Operations Intelligence suite • Some ideas about how those can be applied to your environment • Excitement to apply these ideas • How Dynamic Model can be used to extend the functionality of the existing suite • Suite is highly data-driven you can accomplish a lot with a few small changes • How to use business impact weighting and maintenance windows to prioritize work • Helps focus the operations staff on what is important • Keeps known issues away from the staff until unplanned effects are noticed • Next Steps • Use these techniques in your environment • Feedback is always appreciated
EMC Developer NetworkThe Essential Community for the EMC Developer EDN: EMC Developer Network • http://developer.emc.com • Code, content, collaboration • For and by developers Accelerate your development • Register for EDN • Find the community for you • Search, view, post, question or collaborate on code and content • Participate in Open Exchange • Meet or link with other developers