DSM Scalability Considerations for Unicenter NSM r11

DSM Scalability Considerations for Unicenter NSM r11 Last Updated June 5 2006

Best Practice Summary – see notes • 50k local objects polled in one DSM is fine for r11 • Manage polling to not exceed 600 polls per second • Must configure –m parameter to allow this load • We encourage managing poll cycle use avg >20% and <50% of poll time window • More than 100 DSMs can report to one MDB

Detailed DSM Performance

Objectives • Understand issues affecting DSM performance • Understand issues affecting scalability • Consider architectural options • Recommendations

Issues affecting DSM performance

Understand issues affecting DSM performance • Hardware • Local vs remote DSM(s) • Cold start vs. warm start • Electronic proximity to hosts • Network configuration and congestion • Number of hosts • Number of managed objects • Polling configuration

Hardware • See Hardware Requirements in NSM r11 Implementation Guide for latest guidance

Hardware • Does hardware matter? • 30,000 objects ~= 2 subnets with 50 objects per host

Local vs remote DSM(s) • For smaller implementations a local DSM on the MDB machine is OK • For larger implementations, remote DSM(s) should be strongly considered • DSM should be electronically close to what it polls and may connect to a remote MDB

Local vs remote DSM(s)

Multiple Remote DSMs • Multiple remote DSMs have a synergistic effect

Local vs remote DSM(s) • Local and remote DSM not as strong

Cold start vs. warm start • Set “WarmStart=yes” option in %AGENTWORKS_DIR%\services\config\atmanager.ini • Warm start uses previously discovered objects • Reduces MDB access time • Reduces discovery process time • Must still confirm status

Cold start vs. warm start • Startup measured as time to DSM settling DSM start complete

Cold start vs. warm start • Startup elapsed times

Electronic proximity to hosts • Standard best practice not more than 3 hops • High performance LAN access to hosts and MDB • Avoid WAN links • Given a choice, put a DSM close to what it polls, instead of close to its MDB • Missed traps is in indication of excessive load or network busy – reduce distance of polling/traps

LAN Polling

Network configuration and congestion • DSM should usually handle whole subnets • Fast/stable path to MDB • Network utilization • Errors, timeouts, and retries • Missed traps must be addressed • Poll cycle must have free time for lead peaking • Size counts

WAN Polling

Number of hosts • Affects startup and first stage discovery • Affects total DSM object population • Affects DSM host configuration

Number of objects • Each managed host may spawn dozens of objects • Agents • Watchers • Split DSMs to keep number of objects constrained • Split DSMs to keep electronically close • Obrowser and query with no argument displays objects – actual polled objects usually is fewer

Polling configuration – see notes • Polling interval • Polling rate for r11 DSM sustained at up to 1,000 polls/second (laboratory only – do not exceed 600) • Speeds discovery (?) • Not needed for status polling • 10 to 20 minutes polling still best practice • 50,000 poll-able objects at 10 minute polling interval is about 80 polls/second • Timeouts are critical • Assume timeout 10, retry 2 = 30 second delay • DSM thread waits for reply or timeout on SNMPGET • IP policy makes extensive use of SNMPGET

Polling configuration • Calculating polling rates • Target no more than 50% MaxPollRate utilization and no less than 20% MaxPollRate utilization • 200/sec: five minute interval is 300 seconds so do not attempt more than 30k polls in five minute interval (300 seconds X .50 X 200 polls per second) = 30k objects polled every 5 minutes • Configure [aws_snmp] MaxPollRate in atservices.ini

Issues affecting scalability

Issues affecting scalability • Hardware • What hardware is available? • Can it support MDB + DSM? • Network • How electronically close are managed objects? • Is there capacity to handle polling and trap traffic? • How reliable is the network? • Geographic proximity • Do managed objects exist on other side of WAN? • Polling • What are the polling requirements?

Issues affecting scalability • Type of host activity • Web server • Application server • Database server • Batch server

Architectural options

Architectural Options • Local DSM • Fine for smaller shops • Add remote DSMs as necessary • Add remote DSMs to improve performance • Use several smaller DSMs • Closer to managed objects (most important tuning choice!) • Faster startup • More robust (not single point of failure) • Reduces effect of an outage • Bridged MDBs • Distribute MDBs for better DSM access – not critical unless bandwidth to MDB limited and high update activity

Questions?

DSM Scalability Considerations for Unicenter NSM r11