290 likes | 303 Views
Session Code: ARC332. Design for Operations: Health Model and Instrumentation. Alexander (Sasha) Nosov sashan@microsoft.com Brian Reistad brianrei@microsoft.com Microsoft Corporation. SERVERS. STORAGE. NETWORKING. DSI Architecture (ARC230) Design for Operations. Remote Node Mgmt.
E N D
Session Code:ARC332 Design for Operations: Health Model and Instrumentation Alexander (Sasha) Nosov sashan@microsoft.com Brian Reistad brianrei@microsoft.com Microsoft Corporation
SERVERS STORAGE NETWORKING DSI Architecture (ARC230)Design for Operations Remote Node Mgmt Local NodeMgmt System Level Management Management Tools Dev Tools Dynamic System Services Managed System Your System Definition SDM Service SDM Store Windows Managed Node Settings ARC333 Your Application Health ARC332 Tasks ARC334 Hardware Dynamic Data Center
Agenda • Problem Domain • Health Model • Instrumentation Technologies • Automating the Health Model
Application Availability Problem Service application stops for unclear reason User receives no warning or information how to correct it Telephone rings in the Help Center
What Your Customers Expect • Business today depends upon the computing and network infrastructure • Customers expect that their services or applications are secure, available and never lose data • An actionable warning is received before failure • The root cause of problems can be quickly determined • Failure conditions don’t impact their users • Their environment can be managed with fewer people
Why We Are Not There Today • Applications are not designed with operations in mind • Poor quality of instrumentation • Limited structure and discovery • No clear correlation between instrumentation, root cause and corrective actions • Low signal/noise ratio • Limited infrastructure • Barrier to entry is high for developers • Limited OS support to automate problem detection and resolution • Limited feedback loop from support services
What is the Health Model? • Holistic view of the Application’s different potential problems • How your service may fail from end user’s perspective • State diagram that captures transition to different levels of degradation • Stopped • Healthy • Service Totally Unavailable • Service Partially Unavailable [multiple of these] • Instrumentation is driven by states and transitions • User guidance in what to do in failure cases • The model benefits: • Help desk personnel • Admins and IT pros • Product devrlopers
What is a Health State? • Definition • Description of the state (What’s working, What’s not) • Severity from the app perspective • Detection • What are the different entry points into the state (e.g. events, thresholds, state changes, external checks) • What are the dependencies that are relevant for this state transition • Diagnosis • How to determine the root cause of why we’re in this state • Recovery • What actions should be taken to return to operational state • Verification • How to verify that the application is still in he bad state • How to verify that application has recovered from unhealthy state (after correction)
TS1 TS2 TS3 Session Directory Server Client Client Client Terminal Server example X Problem: the clients cannot connect to a pre-existing session
Terminal Server Example (cont.) • Definition • The Terminal Server X failed to join the Session Directory. The clients cannot connect to pre-existing sessions in the Session Directory. Instead they are be connected to new sessions. • Severity = Error • Detection • 12 different Error Events • EVENT_CALL_TSSDRPCSEVEROFFLINE_FAIL • EVENT_SESSIONDIRECTORY_NAME_INVALID • EVENT_SESSIONDIRECTORY_UNAVAILABLE • EVENT_FAIL_RPCBINDINGSETAUTHINFOEX • . . . • Verification • Inspect Session Directory Server configuration (list of Terminal Servers) • Diagnosis • Different dependencies identified in different entry points (i.e. events) • Check: RPC, SD server running, Correct configuration for SD Server, Network connectivity, DNS resolution • Recovery • Refresh SD Settings on Terminal Server to force rejoin • Verification • Information event reported on operation success • EVENT_JOIN_SESSIONDIRECTORY_SUCCESS State = “Healthy” State = “Can’t Talk to SD Server”
Instrumentation Technologies • Events (Event Log) • Report occurrences of exceptional conditions, record changes • Traces (ETW) • Trace execution of key operations • Probes (WMI) • Expose complex internal state of applications • Expose methods to correct unhealthy states • Perf Counters (Perflib) • Expose simple numeric values for performance monitoring and threshholding • Watson messages (Corporate Error Reporting) • Centrally collect records of failures to provide feedback into product teams
Consider Privacy • Any instrumentation can pose a security or privacy risk • Exposure of at risk items must comply with your corporate privacy guidelines At risk items: • Passwords…before or after encryption or hashing. • User or account names, or SIDs. • Security keys or access tokens. • User data (network, file system, etc) • Configuration information not immediately relevant tocode execution (enterprise policies applied, other software patch level, etc)
Event Log Enhancements • Structured and Schematized events • Common Viewing, Configuration and Querying of Event logs and Trace logs • Scales to support application logs • No need for proprietary logs • Filtering and real time notifications • Forwarding and collection of events across multiple machines • Firewall friendly, using SOAP protocol • The event viewer leverages the new features
WMI Enhancements • Definition: Probes = access to internal state • Exposes existing properties and methods • Needed for monitoring rules • Manual access from command shell available • Easily exposed using attribution scheme • Leverages .net reflection • Schematized instrumentation catalog • Identified by URI • Existing WMI providers automatically published as probes • Remote SOAP access to probes
Monitoring and Autorecovery • Workflow: • Detect problems before users call • Speed diagnosis of root cause • Automatic corrective actions where possible • Components: • Knowledge captured in Health Model • Problem detection, diagnostics and resolution data • Instrumented application • Validated by the Health Model • Monitoring infrastructure: • MOM agent • Windows Monitoring Service • Result – enterprise ready application • Higher service availability • Higher admin efficiency/low cost • Higher users’ trust in your product
Monitoring with Microsoft Operations Manager (MOM) • MOM is Microsoft’s enterprise management solution today • Framework for implementing health model • Enables health monitoring of distributed applications from one console • Key features: • Scalable architecture / network efficient • Automatic discovery / deployment to servers • Natively consumes many data types: events, performance data, custom application logs • Centralized view of a distributed system • Reporting • Enables higher IT service quality at a reduced operational cost
Delivering Knowledge with Management Packs • Implementation of health model • Built by product owners and experts • Creates MOM “Alerts” • Indication of a detected conditions that requires administrator investigation / action • Contain embedded knowledge: aid diagnosis • Appear in MOM console, email or pager notifications • Basic Alerts from state transitions • Advanced Alerts from scripts, e.g. • Synthetic transactions • Security and configuration verification
Monitoring with “Longhorn” • Monitoring capabilities built into the OS • Event filtering and correlation • Forwarding events and alerts • Correlation of events and data • Automated actions and notification • Rich set of rule types and libraries • Common service enables monitoring of • Health, security, performance and configuration • No extra deployment • Monitoring is part of the applications setup • Application manifest includes monitoring rules • Admin can customize default rules, including actions • Your investments in MOM management packs will carry forward
Monitor Application Health Build monitoring rules to correct the problems automatically
Summary: What gets better • Lower manual cost of problem detection, root cause analysis and resolution • Higher service availability using health monitoring and automatic recovery How? • Health Model drives the quality and quantity of information • Instrumentation consistent across components • The instrumentation is discoverable before runtime • Admin controls the levels of diagnostics dynamically • Feedback to improve your product’s next release. • Enhanced management infrastructure in the OS
Call to Action • Visit the booth 19 in Microsoft Pavilion • Great opportunity to drill into technical details with the developers and program managers • Exercise hands-on Labs 401,406,407,408 • See next slide for more info • Ask The Experts: • Tuesday 7 pm – 9 pm in Hall G,H • Design for operations: • Build the model for your application • Have your technical support use and test it • Write and deploy Management Packs • Get ready for Longhorn - install PDC build and create your own manageable application
Resources • Longhorn documentation and whitepapers • www.microsoft.com/windowsserver2003/technologies/management/dsi/designops.mspx • Windows Management Instrumentation Preview • Windows Event Log Preview • Task Scheduler Service Preview • Event Forwarding Service Preview • Monitoring Service Preview • HOL 401 Health Modeling and Instrumentation • MOM training • HOL-406 Building MOM Management Packs to Manage .NET Applications • HOL-408 Monitoring SQL Server with the SQL Server management pack • HOL-407 Extending MOM using the Microsoft Connector Framework and SDK • Web Sites: • http://pdcbloggers.net • http://msdn.microsoft.com/pdc/ • Management Community Forum http://www.microsoft.com/windowsserver2003/community/centers/management/default.mspx
Questions? Don’t forget to submit your feedback