250 likes | 359 Views
NoCOUG Fall Conference 2005. High Availability & Disaster Recovery: A 360 o View Alok Pareek. Agenda. Introduction High Availability - 2005 Industry Shift from MTTF to MTTR Understanding transaction failure models Evaluating HA technologies Real-World HA Examples About GoldenGate
E N D
NoCOUG Fall Conference 2005 High Availability & Disaster Recovery: A 360o View Alok Pareek
Agenda • Introduction • High Availability - 2005 • Industry Shift from MTTF to MTTR • Understanding transaction failure models • Evaluating HA technologies • Real-World HA Examples • About GoldenGate • Questions & Answers
Speaker Introduction/Background • GoldenGate Software Architect • Transactional Data Management and High Availability Solutions • 10 years in Oracle kernel development • Data Storage Technologies • Recovery and High Availability Group • High Speed Data Movement • Key Features • Cross platform/Heterogeneous transportable tablespaces (10g) • Multi threaded redo generation (9.2) • Implementation of multiple block size caches (9.0) • (TSPITR) Tablespace point in time recovery (8.0) • Owner of LGWR process (redo generation) algorithms (8i 10.2)
High Availability, Disaster Recovery • Definition • Ratio of system uptime to sum of uptime and downtime Availability = MTTF/(MTTF+MTTR) • Challenges • Addressing Performance vs. Reliability in computer systems • Hardware Faults, Software Bugs, Human errors are realities in any complex system deployment • Enterprise applications need to function 24x7 • Disasters are no longer a distant threat • Inadequate planning to handle outages • Shift in industry (and academic research) focus • From fault tolerance to reducing MTTR
HA/DR – Systematic View Database 1 Active 2 3 Unplanned outage Planned outage • Node death • Power failure • Upgrades • Migrations System Failure System Changes • Physical Media • Logical corruption Data Failure Data Changes • Maintenance
1) High Availability Concerns – No Outage Database 1 Active • Performance • DSS vs. OLTP • conflicting requirments • Mixed workload • Data validation • Data Transformation
2) Availability Concerns – Unplanned Database Unplanned outage Common Approaches • Database Restore/Recovery • RAID • Shared Disk Clusters • Standby database System failure Data failure
Understanding Database Failures • Failure points • Statement • Process • Instance • Database • Site • Failure types • Physical (Media, corruption, inconsistency amongst redundant copies) • Logical (Incorrect DML, out-of-synch, accidental table drop) • Failure Handling • Automatic • Manual
3) Availability Concerns – Planned Outage Database Planned Outage • Selected windows of downtime • Phased approach to maintenance System Changes Data Changes
Conventional Backup/Recovery RAID multiple hard disks behaving as a single large fast drive (mirrors/stripes/duplexing/parity) Snapshots Block Level Database Replication Change Level Database Replication Remote Mirroring Solutions Transactional Data Management Differentiating HA Technologies High Availability and Disaster Recovery Roll Forward / File Protection
HA Technologies & Tradeoffs • Block based database replication • Standby kept in constant recovery (inactive) mode • Useful for strict disaster recovery only, not HA • Cannot be used for reporting in recovery mode • No write access for distributed load balancing • Application response times suffer after failover • Cannot address availability across heterogeneous systems • Change based database replication • Trigger or log based • Not optimized for real time performance • Intrusive, Complex • Cannot address availability across heterogeneous systems
HA Technologies & Tradeoffs • Remote mirroring solutions • Volume managers maintain mirrors of local writes on a set of remote volumes • Useful for file protection • Physical distance to remote volumes is a critical limitation • No protection from logical corruption, or storage stack corruption • Message basedlogical writes sent by primary host over IP to remote hosts (synchronously/asynchronously) • Write ordering must be maintained by primary host • Remote volumes are standby-only, applications cannot access them • No protection from logical corruption • Hardware based • Storage arrays propagate IOs to storage arrays at a secondary site • Secondary arrays are inaccessible during replication • No protection from logical corruption • Only useful for block availability during DR
HA Technologies & Tradeoffs • Transactional Data Management • Addresses low-latency in HA hybrid computing environments (built on 1 Safe) • Management of transactional streams -- captures, transforms, routes, delivers and verifies data transactions in real time • Real time, heterogeneous, data integrity, low impact • Use cases for HA, DR, data integration, distributed computing • Not for file-level replication Log-based data capture (can be selective) Route Transform (if needed) Apply Target Source Sub-second latency
HA/DR – Solution Examples Database 1 Active 2 3 Unplanned outage Planned outage • Node death • Power failure • Upgrades • Migrations System Failure System Changes • Physical Media • Logical corruption Data Failure Data Changes • Maintenance
Real-World HA Configuration (Active) Primary “Master” • Uni/Bi-directional configuration – if one system fails, the other is immediately available and ready to take over. • For… • Highest Availability • Maximized ROI on hardware (transaction balancing) • Data Centers with Active/Passive • Example areas: • 24x7 (ATMs, Online Banking) • Online Retail • Real-time Analytics, Warehousing Active Secondary “Master” Live/Active
Real-World HA Configuration (Transaction Off-Loading) Processing Commit Transactions • Improve availability and performance of transaction processing by offloading query load to lower-cost DBs/platforms • For… • Horizontal scalability • Improved performance • Example areas: • Online Reservations • Online Lookups Active Live/Active Lookup Query Handling
Real-World DR Configuration (Failover) Database Active • An HA implementation that captures and applies data to a failover system in real time. • For… • Fast failover (No restore) • Do root-cause analysis later! • Surgical Repair (Dynamic, Selective undo) • Example areas: • 24x7/mission-critical applications • Strict SLA requirements Unplanned outage System failure Data failure Failover Database
Real-World Configuration (Switchover) System Changes Data Changes Current Database • Zero-Downtime Migrations • Rolling Upgrades • Zero-Downtime Maintenance • For… • 24x7 availability • Reduced windows for system maintenance • Example areas: • Can’t afford downtime to do in-place upgrade Planned Outage “Switchover” Database
About GoldenGate Software GoldenGate Software is a privately held software company that offers Transactional Data Management solutions. 250 customers... 1500+ solutions implemented… in 35 countries Established, Loyal Customer Base Leading Industry Solutions 18,000 Node ATM Network with 24/7 Availability Achieving paperless enterprise for this visionary healthcare provider Saving $ millions with real-time DW and zero downtime migrations. Database tiering handles average of 300,000 updates/hour, peaks at 800,000/hour
Provides guaranteed capture, routing, transformation, delivery, and verification of data transactions across heterogeneous environments in real time. GoldenGate & Transactional Data Management TDM must be: • Real time Moves with sub-second latency • Heterogeneous Moves transactions across different databases and platforms • Transactional Maintains transaction integrity GoldenGate differentiates on: • Performance Handles thousands of transactions per second with very low impact on IT systems • Extensibility & Flexibility Open architecture to meet demanding customer needs and data environments • Reliability Supports continuous operations and availability
How GoldenGate Works Capture: Committed changes are captured as they occur by reading the transaction logs. Trail files: Stages and queues data for routing. Route: Data is compressed, encrypted for routing to targets. Delivery: Applies transactional data with guaranteed integrity. GoldenGate Veridata™: Reports on data discrepancies between in-use DBs
Questions & Answers Contact Information: Alok Pareek www.goldengate.com (415) 777-0200 apareek@goldengate.com 301 Howard Street, Suite 2100, San Francisco CA 94105
Lag points address Logical failures One-to-Many One-to-One Source Source Target Target (current) Target (1 hour lag) Target (1 day lag)