240 likes | 417 Views
Caché Performance Troubleshooting Part II The System. Vik Nagjee Product Manager, Kernel Technologies. System Performance: Limiting Factors. CPU. Memory. I/ O Disk Network. System system-wide metrics. CPU. Memory. I/ O Disk Network. Latency and/or Queuing. CPU Utilization.
E N D
Caché Performance TroubleshootingPart IIThe System Vik Nagjee Product Manager, Kernel Technologies
System Performance: Limiting Factors CPU Memory I/O Disk Network
System system-wide metrics CPU Memory I/O Disk Network Latency and/or Queuing CPU Utilization Available Memory avwait + avserv/ queue seconds/Read Current Disk Queue [MON.DISK] RespT QLen %USER + %SYS %PROCESSOR [MON.SYS]CPU Busy Freemem pi/po Available Memory Page Reads [MON.PAGE] Unix Windows OpenVMS Unix Windows OpenVMS Unix Windows OpenVMS
Caché system-wide metrics CPU Memory Non DB I/O Files, Network DB I/O CACHE.DAT, WIJ, Journals Physical Block Reads Routine Commands Global References Physical Block Writes Journal Writes Time-to-run or other Application Specific Metrics
Significance: Caché system-wide metrics What are your usersexperiencing? Memory CPU DB I/O CACHE.DAT, WIJ, Journals How busyis Your database? How well is your application using database cache? How well is your disksystem responding? Routine Commands Cache Efficiency Physical Block Reads Global References Time-to-run and other Application Specific Metrics Physical Block Reads Global References
Collecting: system-level metrics system-wide Network management software: OpenView Tivoli BMC OpenNMS Nagios PRTG Traffic Monitor etc. Unix OpenVMS Windows PERFDAT T4 MONITOR sar | glance | nmon iostat | vmstat top | topas Resource and Performance Monitor logman Process Explorer
Collecting: system-wide Caché metrics CPU DB I/O CACHE.DAT, WIJ, Journals Memory Routine Commands Physical Block Reads Cache Efficiency Global References Time-to-run and other Application Specific Metrics
Collecting Caché metrics: GLOSTAT • %SYS>DO ^GLOSTAT
Collecting Caché metrics: ^pButtons • %SYS>DO ^pButtons • Installed in %SYS since 2008.2 but • The latest version (currently 1.15c) is available at ftp://ftp.intersystems.com/pub/performance/ • Can be automated via TASKMGR • Low overhead – logging data that’s already available. • Documented in the Caché Monitoring Guide
Notes on using ^pButtons • Profiles are configurable: • Create custom duration and interval combinations • Add or delete from the OS level metric collection • Collect the logs into one easy-to-use .html file: • Preview a currently running profile’s data: • Available at any point while profile is running. • May result in some truncated data. %SYS>DO Collect^pButtons %SYS>DO Preview^pButtons(runid)
Collecting Caché metrics: Monitors • Caché History Monitor – SYS.History • Collect Caché metrics and User-defined metrics over time • Stored in your Caché database • Query or export the data using a variety of methods • Caché System Monitor – %Monitor.Health • Monitor the system health of your database • Alerts on abnormal metrics based on configurable criteria • Alerts from the System Monitor in cconsole.log: • 04/01/13-13:55:55:847 (13897) 1 [SYSTEM MONITOR] CPUusage Warning: CPUusage = 82 ( Warnvalue is 75)....(repeated 1 times)
Collecting Caché metrics: SNMP/WMI • SNMP, WMI, WSMON • Documented in the Caché Monitoring Guide • Caché metrics are exposed via the SNMP or WMI or Web services • NOTE: Future focus is on SNMP • Add CUSTOM application-specific metrics to be exposed • Use your EXISTING network management infrastructure to collect and alert on Cachémetrics, your applicationmetrics and operating system metrics
System-level clues to performance issues • CPU • Lack of processing cycles ( 0% CPU Idle) • Blocked processes (run queue or device queuing) • Disk • Abnormal disk IO rate • Queuing on devices • Higher than normal latency on busy disk • Memory • Lack of free memory • Hard page faults
Caché-level clues to performance issues • GloRefs and/or RouCmds • Higher than normal? • Your app will be using more CPU… • Are there extraneous processes or more users? • Lower than normal? • Your app may be struggling with another problem (slow disk) • Concurrency issues • Blocked users upstream on the network
Caché-level clues to performance issues • PhysBlkRds • Higher than normal? • Cache size doesn’t match current load • Use of CACHETEMP is forcing more disk reads for other data • Lower than normal? • Maybe that’s ok • App is struggling elsewhere such as lack of CPU cycles • If coupled with abnormally low GloRefs maybe disk latency issue
Application clues! • All the above coupled with application-level clues lead to solutions: • Are users complaining? • Is the rate of application activity the same? • Are batch-jobs/print jobs/screen refreshes completing in a timely manner? • Are your interfaces queuing?
Comparing metrics – add App Metric 0.7/min/user 0.8/min/user 0.8/min/user 0.9/min/user 0.8/min/user
Key points • Many important metrics available for capture • Capture the metrics at all times • Many tools/methods for capturing metrics • Include application-level metrics in your capture • Analysis for capacity or troubleshooting begins with understanding your application’s affects on the system.
You can reach me at: vik@intersystems.com Thanks for attending! Q&A