DM216 The Three Ps of Replication in a High Availability Environment

DM216The Three Ps of Replication in a High Availability Environment • Tony T. Sheng • Technical Team Lead • PLW Data Architecture • Bell Atlantic • tony.t.sheng@bellatlantic.com

Who am I? Overview of Configuration Best Practices Pitfalls The Principles Q & A The Three Ps of Replication in a High Availability Environment

Who am I? Tech Lead for a DBA team ASE/Rep Server administration for DEV/TEST Development of processes for schema changes and custom data conversion to support application Infrastructure and configuration design Production Support for poor performance Around 5 years of experience with Sybase products The Three Ps of Replication in a High Availability Environment

Overview of PLW/CW configuration Homogenous dataservers Bi directional replication No stored procedures, no function strings yet No routes, one RS serves PDB->RDB No Sybase Warm Standby Configuration Overview

R4 HA NJ NJ PA PA AD AD AD AD CU CU CU CU FC FC FC FC SO SO SO SO W1 W1 W1 W1 Configuration Overview

R4 to R2 W1CP NJ NJ PA PA AD AD AD AD CU CU CU CU FC FC FC FC SO SO SO SO W1 W1 W1 W1 W1CP Configuration Overview

R4 to R2 W1CP (in more detail) NJ W1 NJ W1 PA W1 PA W1 W1DC DCW1 DEW1 W1DE MDW1 W1MD VAW1 W1VA W1WV WVW1 W1CP Configuration Overview

CW WQMS WQMS WQMS WQMS WQMS Configuration Overview

Overview of Configuration Why is this so confusing? TUXEDO, XA Server for 2-phase commit Application architecture start up is too long Can use the Rep Server to queue for maintenance and availability Alternative to System 12 Wanted DBAs to be disconnected from the failover Configuration Overview

Design Rep Server On another host to get data out of the Rep Agent fast Its own host to reduce network/CPU bottleneck Lots of disk so you don’t run out of SD Hold lots of transactions in the SD if you need to Data Model Login and timestamp field on every table Could be used in the future for conflict resolution Best Practices

Design Data Server Hardware layout and software configuration all to accommodate more users in case of failover Each database resides on its own set of disks, no controller/disk contention Use 11.5 Rep Agent and Rep Server Hold lots of days in transaction logs if you need to Best Practices

Standards Data Definition Language Each table has unique index and primary key that are relevant (used for auto generation of repdefs) Object name allows generation of a repdef that is less than 30 characters UNIX host and Dataserver naming standards Each host has a isXXXXXX name Each ASE servers is a SQLNAAXXXXXX Best Practices

Standards UNIX host and Data Server naming standards Each host has a isXXXXXX name is001277 Each ASE servers is a SQLNAAXXXXXX SQL1LW001277, SQL1CW002147 Can have a backupserver, XA Server and Rep Server REP1LW001277, REP2LW001277, REP1CW002147 SYB_BACKUP1LW001277 Best Practices

Standards Replication Objects Repdefs - rmachinedatabaseobject r1001931w1mdtlb Subscriptions - smachinedatabaseobject s1001929w1mdtlb Function strings names would also be standard Best Practices

Standards Every Sybase server is configured identically. Filesystems, partitions, disk names, database names Configuration changes between the type of application, ie. sqljumbo, replarge, etc. Flat File Site Specific Information (SSI) Flat files to map database names and replication flow rep_env.${hostname} PRS:PDS:PDB:RRS:RDS:RDB Best Practices

Standards Allows getting a quick diagnosis Assists in getting a quick handle on replication flow Assists to drill down to specific objects that have issues The key is to recover quickly from problems Best Practices

Operational Procedures Who updates where ? Replication is on the whole time and is bidirectional Transactions executed by the maintenance user are not replicated (see send_maint_xacts_to_replicate under sp_config_rep_agent) Best Practices

Operational Procedures Who updates where ? Application user groups are GRANTed/REVOKEd permission based upon operating state (normal versus failover) One dataserver is always considered the REAL ONE, data on this server can be assumed to be correct Best Practices

Operational Procedures Constraints Used in the data model to keep integrity of the data Can bring the Rep Server thread down because of the sequence of the SQL We have been having OPS drop them on the replicate side when necessary Probably should include a DROP/ADD constraint process during the failover Best Practices

Operational Procedures Autocorrection Can help get rid of duplicate key errors that are stuck in the queue when synching the data. Can’t be used with Replicate Minimal Columns (which we don’t do but probably should) Best Practices

Monitor You must monitor!! Rep Agent and Transaction Logs Ran into some bugs with Rep Agent hanging Check that the transaction log dumps are not growing in size This would infer that the Rep Agent is not moving through the transaction log and moving the secondary truncation pointer Best Practices

Monitor Replication Server DSI Threads Had a few days where the thread(s) were down (OPS error) Takes time to recover from this to drain the queue Vulnerable if failover is needed. You would have the potential for in flight transactions. (BIG headache) Best Practices

Monitor Some kind of replication heartbeat rep_verify process Heartbeat table every 10 minutes Best Practices

Monitor Extra Monitoring Rep Server queue throughput Difference between first and last block every hour Capture trends for load, additional data, peak volumes, future capacity Best Practices

Monitor Rep Server queue throughput Best Practices

Monitor Other various monitoring Latency Have not noticed too much latency but could assimilate data with queue throughput rs_lastcommit timestamp on replicate database - the last time an update flowed through select getdate(), db_name, origin, datediff(ss,origin_time,dest_commit_time)from rs_lastcommit where origin>0 and origin_time > "Jan 1 1900" Best Practices

Monitor Other various monitoring Aggregate information for our rollup replication Best Practices 89354 ---- md 1449 ---- wv 328029 ---- pa 63383 ---- dc 25255 ---- de 126779 ---- va 98929 ---- nj 733178 SQL1R2051972 ----------- 733178 SQL1R2051973 ----------- 733178

Tools and Utilities droprep.sh [ $DBNAME | ALL ] Utility to drop all replication components for a database or all databases on the dataserver Large schema changes when it is easier to rebuild all replication Drops subs/repdefs Shutdown Rep Agents and disable Remove secondary truncation point (most important) Best Practices

Tools and Utilities Standard process to build replication Establish connection for PDB/RDB Turn on Rep Agent for PDB Build repdefs/subs Set replicate bit Grant permissions to maintenance user Best Practices

Tools and Utilities REPON/REPOFF.sh Utility to temporarily turn off and on replication Shutdown Rep Agents and disable secondary truncation point Resets truncation point, turns on Rep Agent Best Practices

Tools and Utilities dump/rcp/load Set of scripts to synch up databases from primary to replicate Make sure no users on the replicate so the database can be loaded Must reset generation id, reset truncation point, reconfigure Rep Agent to the correct Rep Server (two way) Best Practices

Tools and Utilities REPdelta Process which adds/drops/synchs replication objects to database schema changes Reads from flat files that have been generated by data model process Best Practices

Recovering Data from Operational Errors Queue was down and a failover was executed Have to figure out what updates were held at the PDB and what updates were executed at the RDB. Timestamp field can help here. Pitfalls

Application Doing Very Bad Things A HUGE Transaction One large update took more than 75000 locks on the replicate database. Rep Server would open the transaction, flow, get an error for insufficient locks, rollback the transaction and then try again (for 36 hours) Had to shutdown the RDS and reconfigure the number of locks and let the Rep Server replicate the transaction Got lucky because the RDS was not in use Pitfalls

Application Doing Very Bad Things Moving the Primary Key of a Unique Index Column to another value This is a documented restriction. Causes the DSI thread to go down because the order of the transaction causes a duplicate key violation Must skip the transaction see Chapter 8 of the Rep Server Admin Guide Pitfalls

Application Doing Very Bad Things Moving the Primary Key of a Unique Index Column to another value Pitfalls Table cu_ln: clustered unique index is on cu_f, cu_ln_f update cu_ln set a.cu_ln_f = (select a.cu_ln_f + max(b.cu_ln_f) from cu_ln b where b.cu_f = 23732123) from cu_ln a where a.cu_f = 982323433 go

Application Doing Very Bad Things Moving the Primary Key of a Unique Index Column to another value the workaround involves making all the updates to a temp table and then inserting them and deleting the old rows Pitfalls 1. select * into #getoldculn from cu_ln where cu_f = OLD 2. update #getoldculn set cu_ln_f = (select cu_ln_f + max(cu_ln_f) from #getoldculn where cu_f= OLD 3. insert into cu_ln select * from #getoldculn where cu_f = OLD

DBA functions Tried to move the logsegment off a device onto another device using sp_dropsegment and sp_extendsegment alter database xlvww1ma log on SYBLDEV26 = 500 go use xlvww1ma go sp_dropsegment "logsegment", "xlvww1ma","SYBLDEV04" go sp_extendsegment "default", "xlvww1ma","SYBLDEV04" go Pitfalls

DBA functions Tried to move the logsegment off a device onto another device using sp_dropsegment and sp_extendsegment Ended up getting 605 and 692 errors on the log segment of the database. Should have turned off the Rep Agent before any of this work. Recovered by reloading the databases. (painful) Pitfalls

DBA functions Bad constraint added manually Had a DSI thread that was up but no data would flow. Maintenance user spid IOs never increased. Queue would drain when autocorrection was turned on Found a bad constraint on the replicate database (most likely put in by manual process: C_AD_gssl_adr_rng_1 adr FOREIGN KEY (adr_f) SELF REFERENCES adr(adr_f) Pitfalls

Synching databases You will probably have to synch your data periodically Skipped transactions Incorrect failover Make this a documented standard procedure so it is readily available. Use autocorrection to avoid the first few duplicate key errors from data that is in the log Pitfalls

A Replication Lab Have a restricted development lab that mimics production Owned by DBAs, no other teams have access Same DS/RS configuration Number of databases, Rep Servers, rs_config, sp_configure Used to isolate replication problems Mock deploy replication fixes Test out new replication topologies Recreate production problems The Principles

Send all SQL through the SW lifecycle When we get burned, 90% is from SQL that didn’t go through the software lifecycle. Be stringent about application transaction size every SQL is reviewed by DBA team for: Performance standards (valid SARGs, good showplan, etc.) Replication Server standards (no update table set col1=‘T’ where col1=‘R’) -> sends into a big update transaction on the replicate The Principles

More Principles We Have Learned Have good standards Assist you in debugging quickly Start simple with Rep Server Monitor, Monitor, Monitor Know your Operating Procedures Step by Step of failover/failback process The Principles

Replicating data through different schemas combination of function strings/operational steps depending on the type of change Configuration that is similar to System 12 move to hot/cold configuration where the disks are shared and switched no Rep Server for “local” failover but still for DR mode data replication What Is Next?

Q & A

DM216 The Three Ps of Replication in a High Availability Environment