340 likes | 636 Views
Best Practices for Maintaining Your Oracle RAC Cluster. Scott Jesse - Customer Support Director, RAC, Storage & RAC Assurance, Oracle Bryan Vongray - Senior Principal Technical Support Engineer, Oracle .
E N D
Best Practices for Maintaining Your Oracle RAC Cluster Scott Jesse- Customer Support Director, RAC, Storage & RAC Assurance, Oracle Bryan Vongray- Senior Principal Technical Support Engineer, Oracle
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
Program Agenda • Introduction • Best Practices and Recommendations • Additional Tips and Resources • Q&A
Oracle Services Enabling the success of your Oracle hardware and software investments • Oracle Experts Helping You Succeed withYour Oracle Investments Complete Support for Oracle Hardware, Software, and Engineered Systems Mission Critical Support Services for All Oracle Applications and Technologies Your Complete Training Source for Oracle Hardware and Software Extend Your Oracle Investments to the Cloud with Value, Choice, and Confidence
Oracle Premier Support Comprehensive Coverage Tools and Resources Service and Support Product Innovation Quickly diagnose and resolve issues • Expert technical support • Rapid-response field service • Lifetime Support Get the most of your Oracle products with proactive services • Oracle knowledgebase • Product health checks • My Oracle Support Community Keep pace with change and capitalize on new opportunities • Updates • New releases • Tools to assist with patching and upgrades
Get Proactive Portfolio—an integral component of your Premier Support Contract Helping you get the most value fromOracle Premier Support
Oracle SupportBest Practices and Recommendations for Grid Infrastructure and RAC
Top Reasons for Logging SRs and Extended SR Resolution times What we see very frequently: • Rediscovery of Known Issues • Complexities in Diagnostics Collection • Escalations due to insufficient diagnostic data
Objectives Introduction to common “tools” used in RAC Environments to: • Decrease the number of Reactive Issues with RAC • Promote Best Practices • Maintain a “healthy” RAC environment • Increase satisfaction with RAC • Reduce the overall Service Request resolution time when dealing with Reactive Types of Issues
Topics of Discussion • Preventing Known Issues • RACcheck • Dealing with Reactive Issues • OSWatcher Black Box • Cluster Health Monitor • Procwatcher • Trace File Analyzer Collector • Q&A
RACcheck – RAC Configuration Audit Tool • Proactiveself-service method for customers to perform Health Checks on their RAC systems • Validation and System-specific feedback on: • Configuration issues that can impact the system • Best Practices/Success Factors that are not being adhered • Documentation on checks for ease of knowledge transfer • Upgrade Readiness functionality for 11.2.0.3+ upgrades • Installed with 11.2.0.4 Database • Developed with the specific goal of automation of the promotion of Best Practices, Success Factors and Successful Upgrades
RACcheck Easy to Install, Easy to Execute • Download latest RACcheck version – MOS Note: 1268927.1 • Transfer raccheck.zip to a single node • Extract raccheck.zip • Execute raccheck • Follow the prompts • Execution times are generally less than 10 min for a 2 node cluster!
RACcheck Output – What to get out of it? • Easy-to-Read HTML Report • Indication of System Health Score • All findings backed up with MOS Notes and Documentation References • Proactive Patch recommendations • Upgrade Readiness Report similar in format • Report Comparison Functionality
RACcheck Supported Environments Note: Similar utilities exist for Exadata and Oracle Database Appliance (ODA). These utilities are respectively called exacheck and ODAcheck
OSWatcher Black Box • Shell scripts to collect and archive OS metrics • Executes UNIX utilities including vmstat, iostat, netstat, etc on regular intervals to track and trend OS metrics • Ability to graph collected metrics • Simple to install, extremely lightweight • Output is requested for reactive node reboot and performance issues • Runs on ALL platforms except Windows • MOS Note: 301137.1 - OS Watcher Black Box User Guide
Cluster Health Monitor (CHM) • Collects OS Metrics Real Time at 5 second intervals (11.2.0.3) • Installed with Grid Infrastructure 11.2.0.3 and above • OS Metrics include Memory, Swap, I/O, CPU as well as stats on “interesting” processes • Data retention dependent on cluster size, 4 node cluster retention is about 1 day • Requested for reactive Node Reboot and Performance related issues • Both OSWBB and CHM are recommended where available • Text interface is the default, however a GUI is available on OTN • Additional Info available in MOS Note: 1328466.1 – CHM FAQ
Procwatcher • Procwatcher is a proactive support tool (written by RAC BDE) to continuously examine and monitor Oracle database and RAC processes at a specified interval. • Generates session wait, lock, and latch reports and collect stacks from any "problem" processes • Ability to collect stack traces of specific processes using Oracle Tools and OS Debuggers
Procwatcher Usage Scenarios • Session level hangs or severe contention in the database/instance • Severe performance issues • Instance evictions and/or RAC timeouts • Clusterware or DB processes stuck or consuming high CPU • ORA-4031 and SGA memory management issues • ORA-4030 and DB process memory issues • RMAN slowness/contention during a backup
Procwatcher vs. Traditional Methods Traditional Methods • An expensive systemstate dump is required for hangs and performance slowdowns • If the issue was not caught in time support will request the necessary information again • The customer needs to get the correct data at the exact right time • If the data was not gathered during the first outage we must “wait” for a re-occurance of the issue • An expensive systemstate dump is required for hangs and performance slowdowns • If the issue was not caught in time by the customer support will request the necessary information again • The customer needs to get the correct data at the exact right time • If the data was not gathered during the first outage we must “wait” for a re-occurance of the issue Long Outages Multiple Interactions Manual Intervention Multiple Outages
Procwatcher vs. Traditional Methods Procwatcher Approach • Always running in the background (at a configurable interval) • Gathers diagnostic data on important and problematic processes • Simple script with 4 basic commands: start, stop status and pack (for ease of upload to SRs) • Non-invasive, quick and history of diagnostics is preserved (1 week by default) Automated Surgical Simple Fast
Procwatcher Collateral • RAC and Cluster “aware” • Proven in the Field • Runs on ALL major UNIX Platforms • Availability and Additional Information: • Install Guide, Known issues and Download MOS Note: 459694.1 • Troubleshooting 4031 errors MOS Note: 1355030.1 • Troubleshooting DB contention MOS Note: 1352623.1 • Typically reduces SR resolution for performance related issues
Trace File Analyzer Collector (TFA) • Diagnostic collection utility to simplify diagnostic data collection on Oracle Clusterware/Grid Infrastructure and RAC systems • Diagnostic data collection for all CRS/GI and RAC components on all cluster nodes into a single command executed from a single node • Trims diagnostic files around incident time to reduce data upload size • Packaged with 11.2.0.4 – Also available standalone • Additional Information can be found in MOS Note:1513912.1
TFA vs. Traditional Diagnostics Collection Traditional Diagnostics Collection • Manual copying/archiving files from multiple locations on a given system and must be repeated for all nodes • Diagnostics for a given node often exceed 350MB, for a 4 node cluster this is 1.4 GB of data to upload to MOS. • The customer needs to get the correct data covering the correct time • If the data was not properly gathered during the first outage we must “wait” for a re-occurance of the issue • Node Reboot • Output of diagcollection.sh • OSWatcher • CHMOS output • Instance Eviction • Alert Logs • DB Trace Files • Output of diagcollection.sh • OSWatcher • CHMOS output
TFA vs. Traditional Dianostics Collection TFA Approach • Proper diagnostics for a particular incident collected, packaged, trimmed around the incident time with a single command (or optionally automatically) • One .zip file per cluster node generated and each are consolidated on a single node for ease of upload • Greatly reduced file size containing only data for a particular incident for ease of upload • TFA is capable of automatically performing data collection when an incident is detected with Real Time Automatic Diagnostic Collection • Node Reboot (manual collection) • # tfactldiagcollect -crs -os -node all -for <incident time> • Instance Eviction (manual collection) • # tfactldiagcollect –all -node all -for <incident time> • Alternatively enable Real Time Automatic Diagnostic Collection
TFA Requirements Note: BASH shell and JRE 1.5 or higher are required on ALL platforms
Other Helpful Utilities and Tools • oratop - Utility for Near Real-time Monitoring of Databases, RAC and Single Instance – Note: 1500864.1 • SQLT (SQLTXPLAIN) – A tool to assist in diagnostics and tuning of a SQL statement – Note: 215187.1
Recap Recommendations • We encourage you to proactively take advantage of the tools outlined in this presentation to provide: • Increased system stability, reliability, and performance • Fewer reactive issues with RAC • Decreased resolution time on reactive issues • Decrease the complexity and eliminate problems with Grid Infrastructure and RAC upgrades
Q&A and Important Support Resources • Discover more about Get Proactive: • Get Proactive with Oracle Database (Doc ID 1389167.1) • MOS Community:https://communities.oracle.com/portal/server.pt/community/scalability_rac/253 • Product Information Center (PIC):Oracle Database 11g Release 2 Information Center (Doc ID 988222.1) • Upgrade/Maintenance Advisors:Patching & Maintenance Advisor: Database (DB) Oracle Database 11.2.0.x (Doc ID 331.1)