220 likes | 405 Views
When A Good Cluster Goes Bad. When A Good Cluster Goes Bad . Agenda Objectives Cluster Impact Cluster Foundations Prevention. Cluster Impact (Tales from a DBA). When A Good Cluster Goes Bad . Not always positive; Not always negative. When A Good Cluster Goes Bad . Cluster Impact.
E N D
When A Good Cluster Goes Bad Agenda Objectives Cluster Impact Cluster Foundations Prevention
Cluster Impact (Tales from a DBA) When A Good Cluster Goes Bad
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • What business need drove the Clustering Decision? • Was there a Business Driver? • What Cluster Solution did you choose? • VERITAS, ServiceGuard, Oracle Real Application Cluster • How has the Clustering helped your organization to achieve your goals?
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • “No, only the servers that need access to the private network are on that VLAN” • Client Background - 24x7 Operation, Consolidated ERP solution (S/C, Finance, Payroll, Online Web Store) • HW Background – Linux on Power, IBM SVC Controllers, Single Gb VLAN for Cluster Interconnect, Single VLAN for Public/VIP interface. • Impact – STONITH (Shoot The Other Node In The Head) • Client Network teamed, consolidated all Private networks onto a single VLAN to simplify management.
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • What business need drove the Clustering Decision? • Was there a Business Driver? • What Cluster Solution did you choose? • VERITAS, ServiceGuard, Oracle Real Application Cluster • How has the Clustering helped your organization to achieve your goals?
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • “We do not support running RAC in a Active/Active Mode. If a failure occurs we will allow HACMP to move the resources from Node A to Node B. And then have HACMP restart Oracle Services.” • Client Background - 24x7 Operation, CERNER Application • HW Background – IBM 595 with LPARS, EMC Storage, Redundant Virtual IO Network interfaces. • Impact – • Estimated 15 Minutes to recover from failover.
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • What business need drove the Clustering Decision? • Was there a Business Driver? • What Cluster Solution did you choose? • VERITAS, ServiceGuard, Oracle Real Application Cluster • How has the Clustering helped your organization to achieve your goals?
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention • “Our procedures for Recovery are ITIL Compliant. We can be up and running on our standby server in 20 minutes.” • Client Background - 24x7 Operation, Internet Services Company • HW Background – IBM 595 with LPARS, EMC Storage, • Impact – • Estimated 20 Minutes to recover from failover. • Client Traffice (600 Million Web hits/day)
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Client:“We have a 30 minute window for unplanned outages per year.” TUSC:”How many outages have you had so far?” Client:”Approximately 2-3 hours per month” • Client Background - 24x7 Operation, Internet Services Company • HW Background – X86-64 Gear 6Node RAC Cluster and 3PAR Storage • Impact – • Upper Management planning to move site to MS SQL Server and MS Cluster.
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Client:”Each time we add our 5th and 6th Node to the cluster, we have node evictions.” • Client Background - 24x7 Operation, Trading Company • HW Background – X86-64 Gear 6Node RAC Cluster; SPOF (Shared Oracle Homes for CRS, ASM and Database, Shared files system for all IO, including OCR and Voting) • Impact – • Unable to maintain SLAs with 4 Nodes; Cannot add capacity reliably.
Cluster Foundation (SPOF, STONITH other Vageries)
Not always positive; Not always negative When A Good Cluster Goes Bad *Cluster Impact Cluster Impact Cluster Foundation Cluster Foundation Prevention Why is the foundation important? • Balance • Scale • Performance
Not always positive; Not always negative When A Good Cluster Goes Bad *Cluster Impact Cluster Impact Cluster Foundation Cluster Foundation Prevention • SPOF Single Points of Failure • A design rule (not a guideline) since the inception of clustering. • Difference with RAC, minimum configuration allows for SPOFs • Business Challenge, “High Availability” becomes “Hardly Available” • SPOFs are not limited to the Oracle Database or the infrastructure that supports it. • People can be Single Points of Failure • Procedures can be Single Points of Failure • Development can be Single Pointes of Failure • Had the client added additional resources to enable Parallel ConCurrent Processing, Several SPOFs would have been eliminated
Not always positive; Not always negative When A Good Cluster Goes Bad *Cluster Impact Cluster Impact Cluster Foundation Cluster Foundation Prevention • SPOF Single Points of Failure • There is never a good reason to implement a cluster with SPOF. • The Business will suffer • IT will Suffer • Eliminate them • This does not require vast planning enormous resources • It does require a good plan.
Not always positive; Not always negative When A Good Cluster Goes Bad *Cluster Impact Cluster Impact Cluster Foundation Cluster Foundation Prevention • STONITH • STONITH happens. • Veritas, Service Guard, HACMP, SUN Cluster • It is a feature built-in to provide service integrity • RAC • It is a feature built-in to provide service integrity • When it appears it is indication that your Cluster Ecosystem is not right. • It does not equate to the Cluster being faulty. • Something in your environment has changed and the cluster is working to stabilize the environment. • In a balanced configuration this should be a non issue. • In a unbalanced configuration this will be Visibility if not outright exposure.
Not always positive; Not always negative When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Cluster Foundation Prevention • Operations • Clusters are unique • Develop Operational guidelines that enable better support. • Establish methodologies that assist cross-functional groups to identify and diagnose problems. • Establish appropriate monitoring • Eliminate Split-Brain in Operations/Triage • Document, Document and Document
Prevention (Make a plan, Then Work the plan)
Prevent Detect Capture Resume Analyze When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Prevention • In a clinical sense • Very Technical Approach • In a business sense • “If you are going throug hell, keep going.” Winston Churchill
Prevent Detect Capture Resume Analyze When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Prevention What key assumptions do we need to document about deploying RAC ? A) Crystal Clear Business Objectives that your are anticipating RAC will Solve. B) Application Code will need to be changed to support specific items of RAC capabilities. C) The migration while documented, will be full of surprises and road blocks. RAC crosses the UNIX/DBA/Network/STORAGE barriers and not gently. D) Operational procedures must change to Support RAC. E) RAC cannot be changed to support how you would like to operate F) By design RAC will determine who is and who is not fit to be in the cluster. It will do this autonomously and efficiently
Prevent Detect Capture Resume Analyze When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Prevention What key assumptions do we need to document about deploying RAC ? G) See item (A) RAC may negatively impact current SLAs that you cannot meet with a Single Instance H) Your Data Model (1500 Tables) and apparent lack of ERWIN diagram physical or logical does not lend itself to application partitioning easily. You will be guessing. I) The current development of Row Level Locking as developed and migrated from Sybase most likely will not work well in your Single Database Multi-Instance RAC environment J) RAC is a magnifying glass - it brings into focus all the bad things your application can/will do. K) RAC is a multiplier - A single Instance/database has a defined set of access points that a single Oracle Instance determines who has access to the database resources. RAC allows multiple instances to manage who has access to the database resources - Block by Block. L) Most shared RAC clusters that leverage application partitioning group access methods by nodes. Batch work will go to Node A, HTTP work will go to Node C, Powerbuilder work will go to Node B.
Prevent Detect Capture Resume Analyze When A Good Cluster Goes Bad Cluster Impact Cluster Foundation Prevention Prevention • Cluster Solutions by design – Aid business and IT department to be more agile. • Ensure that your expectations of your cluster can be delivered by our Cluster choice. • Today with 10g and 11g RAC is very stable business solution • In some cases VERITAS and Service Guard and can assist by adding additional functionality.