Monitoring best practices & tools for running highly available databases

Monitoring best practices &toolsfor runninghighly available databases Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008

Oracle Real Application Clusters

Architecture RAC1 RAC2 RAC5 RAC6 RAC3 RAC4

Highly Available databases – Oracle ‘services’ • Resources distributed among Oracle services • Applications assigned to dedicated service • On node failure, resources re-distributed

Highly Available databases – Apps and DB Release cycle Development service Validation service Production service Production service version 10.2.0.n Production service version 10.2.0.(n+1) • Applications’ release cycle • Database software release cycle Validation service version 10.2.0.(n+1)

Why monitor? • Monitor (n.) • Computer Science. A program that observes, supervises, or controls the activities of other programs. • Need to keep all components in healthy state • We are prepared for single failures, some double failures • Commitment to give 24/7 best effort service • SW misbehavior affecting performance • Trends might indicate need to grow system • Security breaches

Monitoring participants Presentation title - 7

Monitoring participants Presentation title - 8

What we monitor 25 database clusters 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0 10 Tier1 sites for Streams replication 150+ Oracle ‘services’ / applications 2000+ user schemas 1M+ connections/day

PDB-Backup • 2 node cluster • Using Oracle Clusterware • Running: • RACMon (monitoring agents) • StreamMon (monitoring agents) • Backups • Scripts repository • Monitored by Lemon. Set as Critical in Operator procedures

Monitored components Servers Accessibility CDB state Tools: Lemon + RACMon + OEM Disk arrays Accessibility State given by controller Firmware, disk state, disk size, disk speed Tools: Lemon + RACMon Database SW Clusterware state Service accessibility Space available Oracle Streams Tools: RACMon + OEM + StreamMon Database usage OS CPU, I/O User Sessions, CPU, I/O User quotas, tablespace usage Bad usage (short connections, bind variables) Table fragmentation Tools: RACMon, Reports

Best practises (I) No overhead to DB (monitored object) Monitor as much as possible Presentation layer simple & compact Possibility to drill down

Best practises (II) Hierarchy of alarms and notifications Simplicity  reliability Centralized version vs. deployed everywhere Independent blocks (monitoring, dashboard, reporting) for HA

Monitoring tools • Monitoring tools • Lemon, SLS • Basic Monitoring (in house development) • SQL scripts (reactive monitoring) • RACMon (in house development, openlab) • StreamMon (in house development , openlab) • OEM – Oracle Enterprise Manager (Grid Control) - openlab • Service oriented monitoring tools • Experiment reports • DB Availability & Performance Pages

Basic monitoring • Checking every 5 minutes • Each failure  e-mail with error • 3 consecutive failures  SMS • Almost perfect for single instance databases • Limitations • On RAC, system survives to single HW failures • Users connect to ‘service’, not database instance • No other components (storage, clusterware) monitoring • Missing dashboard view

DBA monitoring • SQL scripts – reactive monitoring (ad-hoc monitoring) • Pros: • Easy to use • Fast real time information • Cons: • No global overview • Diagnosing single problem • Requires expert knowledge

RACMon requirements • Reliable (24/7) • Easy to use and configure • Provides up to date information (frequent runs) • Centralized – no configuration or deployment on RAC side • Web interface (RAC monitoring dashboard) – one common place for RACs’ status • Monitoring of Oracle services (DB and user level) and Oracle clusterware • Monitoring of ASM instances (diskgroups and failgroups) • Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility) • Notification send via emails & SMSs to DBAs • Availability numbers (over extended periods of time) • Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)

RACMon Architecture

RACMon - examples

RACMon • Pros/Features: • Customized for our environment • Gives an overview of all our HW and RACs • Configurable alerts (via email and SMS) and alert levels (production or non-production systems) • Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, StreamMon) • Cons: • Requires manpower for development

Oracle Streams “Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.”

StreamMon

StreamMon • Streams availability and usage monitoring • Build in alerting in case of any error in streams stack • Pros: • Monitoring of all T1 sites in one place (streams monitoring not available in any other tool, including OEM) • Convenient and easy to use web interface • Advanced plotting utilities • Cons: • Required manpower for development (currently in maintenance only) • Uses not-standard libraries, requires customized server

Oracle Enterprise Manager • Architecture: • Agent running on each server uploads information to central repository, if repository is not available, it caches data • Management Service provides insight into any monitored target details • Management Service based on set-up metrics and policies sends e-mails (SMSes) • Proactive monitoring possible (actions based on problem diagnostics)

Oracle Enterprise Manager • Oracle Enterprise Manager Grid Control features

Oracle Enterprise Manager • Pros: • Highly configurable alerts, metrics and notification policies • Advanced and easy to use web interface • Easy drill down • External product – fully supported • Cons: • Universal – requires more navigation • No global overview (per target oriented) • Customization for many target requires much work • Bugs may by intrusive (e.g. affecting streams, excessive memory/CPU consumption, storage, DB instances) • Manpower required for maintenance and configuration • Not reliable enough for 24/7 monitoring

Weekly reports • Targeted to experiment DBAs and Coordinators • Information about • Bookkeeping – Application names, contacts • Resource usage – Sessions, CPU, Logical and Physical I/O • Security: Connection errors, expiring passwords, not used schemas • Space: consumed, fragmentation, recycle bin • Bad usage: short connections, queries missing bind variables

Weekly reports PHP scripts Generate report over last 7 days Specific to one RAC cluster

Weekly reports

Weekly reports • Current functionality • Simple way to visualize whole DB usage • Concentrates on main users (dynamic) • Easy to spot problems (color coded) • Very good feedback from our users • Now working on user configurable reports

DB availability and performance page • PHP, aggregation of other tools • Requested by experiments • Dashboard of “current” DB activity • Almost real time monitoring (up to last hour) • Application resource usage • No extra load • uses SLS, RACMon, StreamMon, weekly reports • Possibility to drill down

DB availability and performance page

Summary • Many monitoring components developed for our environment • Out of the box tools not sufficient • Open frameworks – new features easily added • Feedback given to Oracle Enterprise Manager development (openlab) • Very good feedback from T1s and experiments • Components included in experiment dashboards, WLCG ServiceMaps, SLS

Monitoring best practices & tools for running highly available databases