Making Docbases Highly Available & Disaster Recoverable

Making DocbasesHighly Available&Disaster Recoverable

Who We Are • Aaron Weber – Bill Kullmann • Services Platform Technology Group • Stewardship of HA/DR Best Practices

Agenda • HA Overview • DR Overview • Basic Configuration Examples • Docbroker Configurations • Docbase State & Consistency Constraints • Fail-Over Scenarios • Scalability • 5.3 Session Support • Q & A

Highly Available Infrastructure HA Assessment Types of Failure Establishing Priorities Defining HA Criteria HA Advantages and Disadvantages Documentum Solutions Applied Solutions Example Scenarios(Later in Presentation) Documentum High Availability

Highly Available Infrastructure • Loosely coupled component architecture with redundancy at every component level • Fault tolerant to the lowest component level • Automated failover (where possible) • Continuous monitoring and reliable notification • No one-size-fits-all solution • Must be carefully engineered to complement your business requirements • Most common and cost effective configuration is the two-node configuration

HA Assessment • Balance costs with benefits • Risks involved with loss of productivity • Loss of revenue • Opportunity cost related to system downtime • Cost of implementing solutions to prevent business disruption

Types of Failure Most Common Causes of Failures • Human Errors • Planned Downtime • Environmental Reasons • Hardware Failures • Software Failures (by far the most common cause) IEEE Study, 1995

Establishing Priorities Rules Governing High Availability • Before protecting any other component, protect your data • Your largest investment should land here • Deploy only as much high availability as you really need • Make a critical assessment of all systems, data and content • Assume nothing • Set bounded criteria for system architecture and always challenge limits • Only a tested system provides business continuity • Test every component on a regularly scheduled basis • Build your solution so resilient that they never have to failover

Defining HA Criteria • Permissible work loss • Specified as time, e.g., 1 day, 1 hour, 15 min, etc. • Time to Fail Over • Or “time to recover” from a failure - specified as time • Required Up Time • Hours/day, days/week, e.g., 8x5, 24x7, etc. • Permissible SCHEDULED Fail-back Time • Will the business tolerate a lengthy fail-back? • Cost • Usually last in priority and the biggest variable

Availability vs. Reliability • Availability is usually described in “nines” • “Five Nines” is considered optimal (99.999%) • Availability is a function of two basic factors: • Mean Time Between Failures (MTBF) and • Mean Time To Repair (MTTR) • Both are usually measured in hours • Availability is described by the following equation: • Which implies that “scheduled downtime” is to be included in a “nines” measurement and/or SLA

Reliability • MTTR is really the mean time to restore rather than to repair • It includes the following five activities: • Failure Detection • Failure Notification • Vendor/User Response • Repair/Replacement • Recovery/Restart/Reboot • The first four items can be significantly reduced or eliminated by redundant components and automatic reconfiguration

Translating the Metrics

Is Five-Nines Necessary? • Example: • A regional docbase/solution in use 12 hours per day, 5 days a week, 52 weeks a year • Requires availability of 187,200 minutes out of a possible 525,960 minutes per year • Equates to 36.6% availability requirement

Setting Reasonable Goals • Failover should be transparent to the users • No more than a immediate refresh or re-login • Failover should be quick, ideally less than 2 minutes • A good design can yield immediate failover • Minimal or No Human Intervention • No human intervention should always be the primary goal • Minimal or ideally no data loss during disruption • Guaranteed access to the same data set • If your data is stale, all HA goals are moot

Advantages Nearly 100% uptime Contributes to performance and scalability Nearly no data loss in a system failure Backups of systems and data can be performed “live” HA Advantages and Disadvantages Disadvantages • Higher complexity and cost • Effective only when supported by the entire architecture • Special training requirements • Requires redundant hardware components • Change Control Processes must honor the integrity of HA architecture

Documentum Solutions • Documentum Server Clusters (Server Sets) • Supports both active/active and active/passive clusters • Ability to cluster the Content Server to create a high availability solution • Provides partial HA coverage • Provides increased scalability • Easy to manage using latest Documentum Administrator client • Redundant Docbrokers • Ability to automatically reroute users to surviving Content Servers • Docbrokers can be configured to load balance user connections across multiple Content Servers • Replication • Provide RW or RO coverage, depending upon setup and specific objects

Applied Solutions • Content Server and Docbroker • Redundant Server Sets (two-node configurations) • Data • Hot Standby, Advanced Mirroring Solutions • Content • Replication, Advanced Mirroring Solutions • Presentation Layer • Redundant, Load Balanced Servers • Network • Redundant High Speed Network Circuits over totally independent paths

Disaster Recovery Infrastructure • Total separate, independent and redundant facility and systems located far apart from main primary center • No shared resource are permitted between primary data center and DR site • Requires separate and independent high-speed network infrastructure for WAN replication • Failover must be deliberate, not totally automated • Requires intensive planning, new procedures, policies, training special teams and long-term financial commitment • DR site must be ready for failover and occupation at any moment

Basic Disaster Recovery Site Criteria Disaster Recovery is NOT the same as High Availability • DR assumes a total loss of your production data center • DR servers are typically miles apart from production data center and are separate and are totally independent systems • Systems must be protected from natural disaster or physical harm • Systems must have the capacity to assume full normal and emergency workloads • Primary and DR system maintenance and upgrades must be coordinated to ensure complete compatibility • Clients are affected • Roll-back to normal state requires scheduled maintenance

The Extended Team

Service Level Agreement • Docbase SLA • A roll-up of all dependent SLAs • HA • Time to recover from a Single Point of Failure • Permissible work loss • DR • Time to fail-over to DR • Permissible work loss • Multiple, standard Docbase SLAs possible

Basic HA System Diagram RDBMS RDBMS eCS eCS eCS App Svr App Svr RDBMS Cluster EMC MP1 EMC MP2 Celera DeskTop Clients EMC MP3 Centera Load Balance Web Clients

Asymmetric DR Platform Diagram

RDBMS RDBMS RDBMS RDBMS eCS eCS eCS eCS eCS App Svr App Svr App Svr App Svr Dual HA Asymmetric DR System Diagram Production Site DR Site EMC Replication EMC MP1 EMC MP1 EMC MP2 Celera EMC MP2 Celera EMC MP3 Centera EMC MP3 Centera DeskTop Clients Load Balance Load Balance Web Clients

RDBMS RDBMS RDBMS RDBMS eCS eCS eCS eCS eCS App Svr App Svr App Svr App Svr Hybrid DR System Diagram Production Site DR Site EMC Replication EMC MP1 EMC MP1 EMC MP2 EMC MP2 EMC MP3 Centera EMC MP3 Centera DeskTop Clients Load Balance Load Balance Web Clients

RDBMS RDBMS RDBMS RDBMS eCS eCS eCS App Svr App Svr App Svr App Svr Cross DR System Diagram Production Site DR Site EMC Replication EMC MP1 EMC MP1 A A B B C C D D EMC MP2 EMC MP2 EMC MP3 Centera EMC MP3 Centera DeskTop Clients Load Balance Load Balance Web Clients

Celera RDBMS RDBMS eCS eCS eCS eCS App Svr App Svr App Svr App Svr HA Distributed System Diagram Production Site A Production Site B EMC MP1 Jobs run at site A only! Centera cannot be in a distributed store until 5.4 Load Balance Load Balance DeskTop Clients Web Clients Web Clients

DP1B1 Legend RDBMS BFJ • A Process in Running, Inactive, failed, or recovery mode • RDBMS INSTANCE with TABES for Docbases B, F, J, N, R, … • The nth Content Server root-process for docbase X • Process experiencing greater than normal load because of some failover • The nth Web Application Server process • Storage – Storage Attached Network type (SAN) • Storage – Network Attached Storage type (NAS) • Needed so multiple processes on different machines can r/w the same area • Storage – Content Addressable Storage type (CAS) • Centera type storage with ability to enforce retention policies in hardware • Local Storage – ordinary disk attached to server • For binaries and non essential files such as logs, reports, temp space, etc. • Network Interface Card • Clustered Servers • A Domain/Partition (or single n-way Hardware Server) • Naming convention; D - Domain, P – Production, 1 - Site 1, B – Stack B, 1 – first domain in stack. Xn WAS n SAN NAS CAS Local NIC

Example Cross System Partitioning Production Site (Prod 1) “DR” Site (Prod 2) DP1A1 DP1B1 DP2A1 DP2B1 RDBMS DHL RDBMS DHL RDBMS DHL RDBMS DHL Database Tier Active/Passive Cluster Active/Passive DR RDBMS CGK RDBMS CGK RDBMS CGK RDBMS CGK RDBMS BFJ RDBMS BFJ RDBMS BFJ RDBMS BFJ RDBMS AEI RDBMS AEI RDBMS AEI RDBMS AEI DP1A2 DP1B2 DP2A2 DP2B2 A1 E1 A2 E2 A3 E3 A4 E4 Content Server Tier Active/Active HA AP/PA DR B1 F1 B2 F2 B3 F3 B4 F4 G1 G2 G3 G4 C1 C2 C3 C4 D1 D2 D3 D4 DP1A3 DP1B3 DP2A3 DP2B3 Web App Server Tier Active/Active HA Active/Active DR WAS 1 WAS 3 WAS 1 WAS 3 WAS 2 WAS 4 WAS 2 WAS 4

SAN SAN SAN SAN SAN SAN SAN SAN NAS A NAS A NAS E NAS E NAS B NAS B NAS F NAS F NAS C NAS C NAS G NAS G NAS D NAS D Example Cross System PartitioningStorage & Network Overlay HA Site 1 HA Site 2 DP1A1 DP1B1 DP2A1 DP2B1 RDBMS or SAN replication NIC Local RDBMS DHL RDBMS DHL RDBMS DHL NIC RDBMS DHL Local Local RDBMS CGK RDBMS CGK RDBMS CGK RDBMS CGK Local NIC Local RDBMS BFJ RDBMS BFJ RDBMS BFJ RDBMS BFJ NIC Local Local RDBMS AEI RDBMS AEI RDBMS AEI RDBMS AEI Local DP1A2 DP1B2 DP2A2 DP2B2 Snap Mirror ? NIC A1 E1 A2 E2 A3 E3 A4 E4 NIC Local B1 F1 B2 F2 B3 F3 B4 F4 NIC G1 G2 G3 G4 C1 C2 C3 C4 NIC D1 D2 D3 D4 SAN SAN DP1A3 DP1B3 DP2A3 DP2B3 SAN SAN Local NIC SAN SAN NIC WAS 1 WAS 3 WAS 1 WAS 3 SAN SAN NIC WAS 2 WAS 4 WAS 2 NIC WAS 4 Centera Replication Centera Centera CAS CAS

eCS 1 A Before Docbrokers M1 DMCL NIC IP1 DMCL

eCS 1 A Basic Docbroker M1 DMCL 1 NIC IP1 DMCL

eCS 1 A eCS 2 A Multiple Servers M1 DMCL 1 NIC IP1 DMCL

eCS 1 A eCS 2 A Multi Server M1 M2 DMCL DMCL 1 NIC IP1 NIC IP2 DMCL

eCS 1 A eCS 2 A Redundant Docbrokers M1 M2 DMCL DMCL 1 1 NIC IP1 NIC IP2 DMCL

eCS 1 A eCS 2 A Private Docbrokers M1 M2 DMCL DMCL P1 P1 1 1 NIC IP1 NIC IP2 DMCL

eCS 2 A eCS 1 A Multiple NIC Docbrokers M1 M2 DMCL DMCL P2 P1 P1 P2 1 2 2 1 NIC IP1 NIC IP2 NIC IP3 NIC IP4 DMCL

eCS 1 A eCS 4 A eCS 2 A eCS 3 A Multiple NIC & CS Docbrokers RDBMS Cluster M1 M2 DMCL DMCL P2 P1 P1 P2 2 1 1 2 NIC IP1 NIC IP2 NIC IP3 NIC IP4 DMCL

eCS 4 A eCS 1 A eCS 3 A eCS 2 A Example VM Partitioning M1 M2 VM1 VM5 VM2 VM4 DMCL DMCL DMCL DMCL P1 P2 VM3 VM6 P1 1 2 P2 1 2 NIC IP1 NIC IP2 NIC IP3 NIC IP4 DMCL

Table Table Table Table Docbase State RDBMS File System Data Ticket = 800000AC /fsvr/<docbA>/cont_store_01/<dbID>/80/00/00/AC 80 Dir 00 Dir 01 00 Di Dir 00 File File 01 File File File File AC File FF File

Constraint on Backup/Restore RDBMS Backups DCTM Managed Transaction Time Line File System Backups Table File A Keep small as possible • OK for files to point to missing metadata • NOT OK for metadata to point to missing files Table File X Table File Table File Y Table File Wasted Backup if RDBMS can’t roll forward B Table File Database can roll forward to point of Compatible FS backup Table File Z Table File C

Distributed Constraints Time Line Location 1 Location 2 I Table File A Table File X Table File J Table File Y Table File B Table File K Database can roll forward to point of EARLIEST Compatible FS backup X Table File Z Table File C

1Z 2Z Replicated Constraints Location 1 Location 2 CAREFUL! Server Times must be synchronized CAREFUL! Server Times must be synchronized Table File Table File 1A 1 Table File 2A 1X 2X Table File Table File Table File 1Y 2Y 2B Table File 2 1B Table File Table File 3 2C Table File Table File 1C Must be able to roll forward 1, 2, or 3

Replicated Docbase Failure • 3 cases • Master, Intermediate, Tail • If docbase is restored to last backup • dangling references -> bad UI behavior • Can’t restore all docbases to same point • excessive work loss • there is no ‘same point’ worldwide

R R R miRror object reFerence object replica Catalog entry Job Job Instance master link source link mirror link F J J F J F JI C JI C C JI Starting Point (Backed up)

R R R miRror object reFerence object replica Catalog entry Job job Instance master link source link mirror link F F J J F J JI C C JI A Restored

Making Docbases Highly Available &amp; Disaster Recoverable

Making Docbases Highly Available &amp; Disaster Recoverable

Presentation Transcript

Making Docbases Highly Available & Disaster Recoverable

Making Docbases Highly Available & Disaster Recoverable