530 likes | 546 Views
Learn about CallManager Database Architecture, Replication Flow Diagram, and Troubleshooting techniques to ensure seamless data replication.
E N D
Cisco CallManager Database Replication Vajrender (Sunny) Akkera
Agenda • CallManager Database Architecture • DB Replication Flow Diagram • What could possibly break DB replication • How to verify if DB Replication is broken • Troubleshooting Database Replication issues • Replication Logs • Closing
DB Architecture : Install/Ugrade • In 5.0 and 5.1 • The publisher upgrade migrates data prior to reboot to the new version. • The subscriber starts replication setup after it is upgraded and rebooted. • Replication setup pushes data from the publisher to the subscriber. The subscriber’s local database is ready for failover only after replication is complete.
DB Architecture : Install/Ugrade In 6.X + • The publisher upgrade migrates data and performs an ontape (Informix utility) backup prior to reboot to the new version. • The subscriber upgrade gets the publisher ontape backup via SFTP, and restores that data to the subscriber. (This gets the data close in content which is imperative for services reading data local.) The subscriber starts replication setup after the upgrade and reboot. • Replication setup audits the data and pushes differences between the publisher and subscriber to the subscriber. Change notification is sent to the local services for each change. The local database is ready before replication is complete. The replication setup timeout is set-able via CLI “utils dbreplication setrepltimeout 900” (15 minutes) • User Facing Features (listed on a later slide) are backed up locally on all servers prior to upgrade and reboot and restored after reboot so that any changes made by users during the upgrade are not lost.
User Facing Features (UFF) This Data can be written into the local DB • Call Forward All (CFA) • Message Waiting Indication (MWI) • Privacy Enable/Disable • Do Not Disturb Enable/Disable (DND) • Extension Mobility Login (EM) • Monitor (for future use, currently no updates at the user level) • Hunt Group Logout • Device Mobility • CTI CAPF status for end users and application users • Credential hacking and authentication
DB Architecture: Replication 6.X • Replication is now fully meshed. A change on any server gets propagated to every other server. • Only UFF data is writeable on a subscriber, so that is the only data that will replicate from a subscriber. • Logically, most data is still hub-and-spoke from a replication perspective, since most data is still only updateable on the publisher. • Replication queues on the subscriber are now used. • Perfmon counters for replication are now used on subscribers. • Replication now impacts data availability and change notification.
Steps to DB Replication These steps are done automatically by the replication scripts when the system is installed. When we do a “utils dbreplication reset all”, these steps get done again. • Tears down the replication . ‘ CDR DELETE Server’ – this can cause corruption of syscdr database . • Define publisher - This will help to set it up to start replicating • Define template on publisher and realize it - This tells publisher what tables to replicate. • Define each subscriber • Realize template on subscriber - This will tell subscribers what tables they will get/send data for. • Synchronize the data. When we look at the log files, we see output from steps 3, 4,and 5. Each subscriber will define by itself, but the realize and sync step shows up in the ‘dbl_repl_output_Broadcast_.logfile’. There may be one subscriber, or many in the "batch".
What could possibly break Replication • Connectivity issues between nodes • Host Files Mis-match • Communication on UDP port 8500, not in phase 2 • DNS not configured properly (forward/reverse lookup) • NTP not reachable • ‘A Cisco DB’ and ‘A Cisco DB Replicator’ not running/working • Cisco Database Layer Monitor (Dbmon) hung/stopped
DB Replication Troubleshooting • How do we verify if replication is broken • Commands to diagnose and fix replication • If you cannot fix it, what trace files do we collect • If customer needs an RCA, we would have to run the special ‘ercollect’script on the server.
How to verify if Replication is broken? • Replication failure alert • Replication status counter not being in good state (can be watched proactively) • CLI for replication status shows tables suspect or missing servers. • CM Database Status Report under Unified Reporting • Verify the output for “utils dbreplication runtimestate” on the publisher. (if the command is available)
How to verify if Replication is broken? What the replication state counter means: 0 = Initialization 1 = Number of replicates is not correct (old sys) 2 = Replication is good 3 = Replication is bad 4 = Replication setup did not succeed (this meaning is for 5.1.3 and all 6.X versions) .
How to verify if Replication is broken? show perf query class "Number of Replicates Created and State of Replication” admin:show perf query class "Number of Replicates Created and State of Replication" ==>query class : - Perf class (Number of Replicates Created and State of Replication) has instances and values: ReplicateCount -> Number of Replicates Created = 348 ReplicateCount -> Replicate_State = 2
How to verify if Replication is broken? admin:utils dbreplication runtimestate DB and Replication Services: ALL RUNNING Cluster Replication State: Replication status command started at: 2010-05-13-15-53 Replication status command COMPLETED 427 tables checked out of 427 No Errors or Mismatches found. DB Version: ccm7_1_3_10000_11 Number of replicated tables: 427 Cluster Detailed View from PUB (2 Servers): PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT) & details ----------- ------------ ------ ---- ----------- ----- ------- ----- ----------------- Publisher 14.128.62.72 0.063 Yes Connected 0 match N/A (2) PUB Setup Completed subscriber 14.128.62.73 0.384 Yes Connected 0 match N/A (2) Setup Completed
Troubleshooting Steps • Verify Connectivity • Verify Host Files are in sync. • Connectivity on UDP port 8500 • Verify NTP reachability and Network Validation • Is the publisher failing to define the template or realize the template • DB Replication Commands
Troubleshooting : Verify Connectivity Utils network connectivity This command can take up to 3 minutes to complete. Continue (y/n)?y Running test, please wait ... . Network connectivity test with the publisher completed successfully. Note : Command can be run only on the Subscribers Utils network host <hostname/ipaddress> • Verifies DNS resolution Utils network ping <hostname/ipaddress> • Helps verify connectivity between nodes.
Troubleshooting : Verify Host Files • /etc/hosts • /etc/services • /home/informix/.rhosts • /usr/local/cm/db/informix/etc/sqlhosts
Troubleshooting : Verify Host Files admin:show tech network hosts -------------------- show platform network -------------------- /etc/hosts File: #This file was generated by the /etc/hosts cluster manager. #It is automatically updated as nodes are added, changed, removed from the cluster. 127.0.0.1 localhost 14.128.62.3 CM613 14.128.62.6 CM613SUB
Troubleshooting : Verify Host Files admin:show tech dbstateinfo Database State Info Output is in cm/trace/dbl/showtechdbstateinfo20593.out admin:file view activelog cm/trace/dbl/showtechdbstateinfo20593.out (Hit ‘e’ to go to the end of the file) #SQL Hosts: g_hdr group - - i=1 g_cm613_ccm6_1_3_1000_16 group - - i=2 cm613_ccm6_1_3_1000_16 onsoctcp CM613 cm613_ccm6_1_3_1000_16 g=g_cm613_ccm6_1_3_1000_16 b=32767 g_cm613sub_ccm6_1_3_1000_16 group - - i=3 cm613sub_ccm6_1_3_1000_16 onsoctcp CM613SUB cm613sub_ccm6_1_3_1000_16 g=g_cm613sub_ccm6_1_3_1000_16 b=32767 # .rhosts: localhost CM613 CM613SUB
Troubleshooting : Data Access Failure admin:utils firewall list ACCEPT tcp -- CM613SUB anywhere tcp dpt:cm613_ccm6_1_3_1000_16 ACCEPT udp -- CM613SUB anywhere udp dpt:1500 ACCEPT tcp -- CM613SUB anywhere tcp dpt:1501 ACCEPT udp -- CM613SUB anywhere udp dpt:1501 … • This example above is from a pub (CM613) where CM613SUB is the sub. Sub should have similar entries for pub. If they do not, it is probably a network issue. • TCP port 1501 is used by callmanager database at the time of migration (upgrade). • Ensure all servers in cluster have good status (TCP and ACCEPT on port 1500 and is named by server). Else Verify the Cluster Manager Logs. - File list activelog platform/log/clustermgr* - File view activelog platform/log/clustermgr00000002.log Example : 06/14/2010 23:22:03.009 clm|HMAC_SHA1 match failed IP(14.128.62.6)| (Failed) 03/25/2010 06:52:39.864 clm|hostname: CM613SUB state POLICY_INJECTED| (Success)
Troubleshooting : Data Access Failure // Cluster Manager Log (file list activelog platform/log/clustermgr*) 03/25/2010 06:52:24.547 clm|exec'ing: /root/.security/drf/setdrfdetails.sh 03/25/2010 06:52:24.636 clm|Binding to /usr/local/platform/conf/clm/unix_socket 03/25/2010 06:52:24.636 clm|creating 2 state machines 03/25/2010 06:52:24.637 clm|succeeded to create sm for: CM613SUB 03/25/2010 06:52:24.637 clm|exec'ing: sudo /root/.security/ipsec/disable_ipsec.sh --desthostName=CM613SUB --op=delete 03/25/2010 06:52:26.215 clm|hostname: CM613SUB state INITIATOR| 03/25/2010 06:52:26.356 clm|exec'ing: /etc/init.d/iptables start 03/25/2010 06:52:27.340 clm|ignoring initiation from other side peer hostname(CM613SUB) 03/25/2010 06:52:33.804 clm|exec'ing: /etc/init.d/iptables start 03/25/2010 06:52:35.750 clm|for initator(CM613SUB): entering the policy injected state 03/25/2010 06:52:39.864 clm|hostname: CM613SUB state POLICY_INJECTED
Troubleshooting : Data Access Failure admin:utils network capture port 8500 Executing command with options: size=128 count=1000 interface=eth0 src= dest= port=8500 ip= 22:09:10.479943 CM613.8500 > CM613SUB.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:10.481232 CM613SUB.8500 > CM613.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:15.474954 CM613SUB.8500 > CM613.8500: isakmp: phase 2/others ? #71[C] (DF) 22:09:15.475677 CM613.8500 > CM613SUB.8500: isakmp: phase 2/others ? #71[C] (DF) • Verify the communication is in phase 2 in both directions (pub->sub, sub->pub). If you have multiple nodes in the cluster, all the nodes must be in ‘phase 2’ with every other node in the cluster. • You could verify the CM system logs to verify if the server is in ‘policy injected’ state with other nodes.
Troubleshooting : Verify NTP reachability and Network Validity admin:utils diagnose test Log file: /var/log/active/platform/log/diag4.log Starting diagnostic test(s) =========================== test - disk_space : Passed (available: 849 MB, used: 4998 MB) skip - disk_files : This module must be run directly and off hours test - service_manager : Passed test - tomcat : Passed test - validate_network : Passed test - system_info : Passed (Collected system information in diagnostic log) test - ntp_reachability : Passed test - ntp_clock_drift : Passed test - ntp_stratum : Passed Diagnostics Completed
Troubleshooting : Is the publisher failing to define the template or realize the template • Verify the logs to see at what point is the replication failing. admin:file list activelog /cm/trace/dbl date det 15 Jun,2010 10:45:17 <dir> dblj 15 Jun,2010 10:45:17 <dir> ncsj 15 Jun,2010 10:45:17 <dir> sdi 19 Nov,2009 18:53:44 1,847 2010_09_15_11_14_58_ne042_ccm_164_ccm8_6_0_96000_16_dbl_repl_cdr_define.log 19 Nov,2009 18:59:57 299,786 2010_09_15_13_10_20_dbl_repl_cdr_Broadcast.log 19 Nov,2009 18:59:57 1,261 2010_09_15_13_10_20_dbl_repl_output_Broadcast.log • Will explain more on this from slides 48-52
DB Replication Commands Utils dbreplication status • This command displays the status of database replication by comparing the database content of subscribers to the Publisher. It will indicate if the servers in the cluster are connected, and if the data is in sync. Utils dbreplication stop • This command stops automatic replication setup on the local server waits the replication timeout and stops the automatic replication setup again. • You would want to wait it out to run the following (reset) commands. • This command is typically run prior to running ‘reset’ • Stop the replication on the subs first and then the pub. • After we stop on the pub, it waits the ‘repl timeout’ to start replication. • We would have to reset to initiate replication as all the automatic setup processes are stopped.
DB Replication Commands Utils dbreplication repair • This command repairs data if they are out of sync. • This command is run when “utils dbreplication status” shows connected and few tables are out of sync. Syntax: utils dbreplication repair {all | hostname} Utils dbreplication reset • It can be used to tear down and rebuild replication when the system has not set up properly. Syntax: utils dbreplication reset {all | hostname}
DB Replication Commands Utils dbreplication setrepltimeout Syntax : utils dbreplication setrepltimeout timeout Timeout - The new database replication timeout, in seconds. Value Range is between 300 and 7200. • The default database replication timeout equals 5 minutes (value of 300). • This timer comes into effect for both the replication stop and reset replication commands. For reset, it waits for the timer after defining the servers and then realizes the template. • When the first subscriber requests replication with the pub, this timer will be set. • When the timer expires, the first sub plus other subs that requested replication within that time period begin data replication with the pub in a "batch". • For large clusters, you can use the command to increase the default timeout value, so more subs will be included in the batch. • This timer should be set on the publisher after publisher has been upgraded and booted up on the upgraded partition, but before first sub has been switched over to new release. Then, when the first sub requests replication, the pub will set the timer based on this new value. Note: It is recommended you restore this value back to the default of 300 (5 minutes) once the entire cluster is upgraded successfully and subs have successfully set up replication.
DB Replication Commands admin:show tech repltimeout -------------------show tech repltimeout ------------------- The Replication timeout is set to 300 seconds • This command helps you determine the ‘repltimeout’ set on the cluster
DB Replication Commands Utils dbreplication runtimestate • This command helps to make sure the Publisher is able to communicate with all the subscribers DBLRPC service aka Database Replicator. Verify the RPC column. • Typically run before running the ‘reset’ command. admin:utils dbreplication runtimestate DB and Replication Services: ALL RUNNING Cluster Replication State: Replication status command started at: 2010-05-13-15-53 Replication status command COMPLETED 427 tables checked out of 427 No Errors or Mismatches found. DB Version: ccm7_1_3_10000_11 Number of replicated tables: 427 Cluster Detailed View from PUB (2 Servers): PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT) & details ----------- ------------ ------ ---- ----------- ----- ------- ----- ----------------- Publisher 14.128.62.72 0.063 Yes Connected 0 match N/A (2) PUB Setup Completed subscriber 14.128.62.73 0.384 Yes Connected 0 match N/A (2) Setup Completed
DB Replication Commands – Last Resort Utils dbreplication clusterreset • This command can be used to debug database replication, but should only be used if "utils dbreplication reset all" has previously been tried and has failed to restart replication on the cluster. • This command will tear down and rebuild replication for the entire cluster. • After using this command, each sub needs to be rebooted. • Also, once the subs have been rebooted, you must go to the pub and issue the CLI command "utils dbreplication reset all". • RCA cannot be determined once you run this command. Syntax : utils dbreplication clusterreset Utils dbreplication dropadmindb • This command drops the Informix syscdr database on any server in the cluster. • You should run this command only if database replication reset or clusterreset fails to define a particular node in the replication process. • RCA cannot be determined. Syntax : utils dbreplication dropadmindb
DB Replication Command : Example Utils dbreplication status • Good Status • Check the output to be sure each server is connected, and no tables are suspect • The status should list all the subscribers as being connected at the top of the file, and no tables are suspect SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_bldr_ccm4_ccm 2 Active Local 0 g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15
DB Replication Command : Example -Bad Status – Servers out of Sync • If RTMT counter value for replication state is 2 or 3 for all nodes of the cluster, then replication is set up. • Replication state 3 states, there are a few tables that are out of sync. • You would run a ‘dbreplication repair’ to clear this issue. (Slide 31) SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_bldr_ccm4_ccm 2 Active Local 0 g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15 ---------- Suspect Replication Summary ---------- For table: ccmdbtemplate_bldr_ccm4_ccm_1_27_processnodereplication is suspect for node(s):g_bldr_ccm5_ccm For table: ccmdbtemplate_bldr_ccm4_ccm_1_34_replicationdynamicreplication is suspect for node(s):g_bldr_ccm5_ccm -------------------------------------------------
DB Replication Command : Example • Bad Status – Replication not setup properly • One or more nodes or some servers shows "Quiescent" or "Dropped" Status. This status is not necessarily bad as the server could have been shut-down or in the middle of replication. • If the server’s replication status is “Failed” , it is in a bad state. • This would typically show a replicate state of 0 or 4. • You would run a ‘dbreplication reset’ to clear this issue. SERVER ID STATE STATUS QUEUE CONNECTION CHANGED ----------------------------------------------------------------------- g_bldr_ccm4_ccm 2 Active Local 0 g_bldr_ccm5_ccm 3 Active Dropped 636 Sep 10 14:01:20 Possible causes : • A communications problem/ network error(publisher and subscriber cannot talk. • One or more ports that is required by the database is not opened on the firewall. • Host files not setup properly.
Commands introduced in CM 7.X Utils dbreplication forcedatasyncsub • This command forces a subscriber server to have its data restored from data on the publisher server. • Use this command only after you have run the utils dbreplication repair command several times, but the utils dbreplication status command still shows non-dynamic tables that are not in sync . • This command can take a significant amount of time to execute and can affect the system-wide IOWAIT. • This command takes a database backup of the publisher server and restores that data into the database on the subscriber server. • This command erases all existing data on the subscriber server and replaces it with the database from the publisher server, which makes it impossible to determine the original root cause for the subscriber server tables going out of sync. • After you run this command, you must restart the restored subscriber servers. • This command causes an outage on the server it is run. • This command is used as a last resort and once used, RCA cannot be done. Syntax : utils dbreplication forcedatasyncsub {all|hostname}
Commands introduced in CM 7.X Utils dbreplication quickaudit • This command runs a quick database check on selected content on dynamic tables. This option will only check selected content of dynamic tables: - Number of configured devices, directory numbers, and users - Number of mobility devices changing device pool - Number of extension mobility users logged in - Number of active extensions with DND set - Number of active extensions with MWI set - Number of active extensions with CFA set Syntax : utils dbreplication quickaudit nodename | all Utils dbreplication dropadmindbforce • Drops the Informix syscdr database on the server which it is run Syntax : utils dbreplication dropadmindbforce
Commands introduced in CM 7.X Utils dbreplication repairreplicate • This command repairs mismatched data between cluster nodes and changes the node data to match the publisher data. • It does not repair replication setup. Syntax : utils dbreplication repairreplicate replicatename [nodename]|all Utils dbreplication repairtable • This command repairs mismatched data between cluster nodes and changes the node to match the publisher data. • It does not repair replication setup. Syntax : utils dbreplication repairtable tablename [nodename]|all
Replication Logs From the Publisher • File get activelog cm/log/informix/*dbl_repl*.log • File get activelog cm/trace/dbl/*dbl_repl*.log • File get activelog cm/log/informix/ccm.log* • File get activelog cm/log/informix/ats/* • File get activelog cm/log/informix/ris/* • File get activelog cm/ltraces/dbl/sdi/dbmon*.txt • File get activelog cm/log/info • Run ‘utils dbreplication status’ and file get activelog cm/trace/dbl/sdi/ReplicationStatus* • File get activelog cm/trace/dbl/sdi/ReplicationRepair* • File get activelog cm/trace/dbl/sdi/replication_scripts_output.log • utils diagnose test o/p (file get activelog /platform/log/diag2.log)
Replication Logs From the Subscribers • File get activelog cm/log/informix/ccm.log* • File get activelog cm/trace/dbl/sdi/dbmon*.txt • File get activelog cm/log/informix/ats/* • File get activelog cm/log/informix/ris/* Download the following unified reports • Database Status • Cluster Overview • Replication Debug
Replication Logs admin:file list activelog /cm/trace/dbl date det 15 Jun,2010 10:45:17 <dir> dblj 15 Jun,2010 10:45:17 <dir> ncsj 15 Jun,2010 10:45:17 <dir> sdi 19 Nov,2009 18:53:44 1,847 dbl_repl_cdr_define_subscriber_ccm7_1_3_10000_11-2009_11_19_18_53_21.log 19 Nov,2009 18:59:57 299,786 dbl_repl_cdr_Broadcast_2009_11_19_18_58_44.log 19 Nov,2009 18:59:57 1,261 dbl_repl_output_Broadcast_2009_11_19_18_58_44.log
Replication Logs : Sample Define [# cat 2010_09_15_11_14_58_ne042_ccm_164_ccm8_6_0_96000_16_dbl_repl_cdr_define.logpassed dbname [ccm6_1_0_9901_391]dbname passed[ccm6_1_0_9901_391] local_dbname [ccm6_1_0_9901_391]-------Inside deleteQuiescent-------subscriber name: g_nw104a_196_ccmsucmd to execute [su -c 'cdr list serv > /tmp/cdr_list_serv_local_quiescent' - informix]-------Exiting deleteQuiescent------- sucmd_err [su -c 'ulimit -c 0;cdr err --zap' - informix ]Executing [su -c 'ulimit -c 0;cdr define server --connect=nw104a_196_ccm --idle=0 --init --sync=g_nw104a_212_ccm g_nw104a_196_ccm --ats=/var/log/active/cm/log/informix/ats --ris=/var/log/active/cm/log/informix/ris;' - informix]After Executing [su -c 'ulimit -c 0;cdr define server --connect=nw104a_196_ccm --idle=0 --init --sync=g_nw104a_212_ccm g_nw104a_196_ccm --ats=/var/log/active/cm/log/informix/ats --ris=/var/log/active/cm/log/informix/ris;' - informix]---------------START--------------------Inside getServCountonpublisher-------sucmd to execute [su -c 'cdr list serv > /tmp/cdr_list_serv_local' - informix]---Inside-------locateFailure servcount_on_publisher is [1] sleeptime is[10]SERVER ID STATE STATUS QUEUE CONNECTION CHANGED-----------------------------------------------------------------------g_nw104a_196_ccm 17 Active Local 0g_nw104a_212_ccm 2 Active Connected 0 Sep 24 16:43:20Count on node [g_nw104a_196_ccm] is [1] count_on_publisher [1]-------LocateFailure-------Returning--------------servcount_on_publisher is [1]--------------END-------------sucmd [su -c 'ulimit -c 0;cdr err -a' - informix >> /usr/local/cm/db/cdr_err_define.out 2>&1]size of cdr_err.out is [64]
Replication Logs : Sample Define In the above 2010_09_15_11_14_58_ne042_ccm_164_ccm8_6_0_96000_16_dbl_repl_cdr_define.log output, • Servers show Local or Connected which is good. • Shows size of cdr_err.out is [64] which is good