470 likes | 803 Views
Troubleshooting Methods for UCS Customer POCs and Labs. August 2012. Agenda. Why would we need this presentation? Overview of some recurring items to address Infrastructure Items Adapter and IOM systems troubleshooting Server systems troubleshooting Operating systems troubleshooting
E N D
Troubleshooting Methods for UCS Customer POCs and Labs August 2012
Agenda • Why would we need this presentation? • Overview of some recurring items to address • Infrastructure Items • Adapter and IOM systems troubleshooting • Server systems troubleshooting • Operating systems troubleshooting • Chassis systems troubleshooting • Fabric Interconnect systems troubleshooting
Presentation Goals • Often there are “lessons learned” which can be shared do simplify the POC process • Some of the known bugs and operational details are lost in the depths of planning customer scenarios • This is not a review of how to develop a testing plan, nor a script in running a POC • The key goal is to help information sharing to put best foot forward • All required UCS system training and real-world hands on experience is assumed • This is a living presentation – the goal is to keep updated with these lessons learned and common bug issues – to present to the field on request
Basic Connection Points • Data and Many Control Plane functions are Active/Active cluster • User to UCSM is Active/Standby to Virtual IP (VIP) • These management connections are where the Blade CIMC connections are reached via the unified IO of UCS • Blade CIMC are actually NAT entries on the mgmt port • UCSM Client (Centrale) or CLI to manage and troubleshoot UCS VIP FI-A IP FI-B IP
What Runs on NX-OS Kernel • The fundamental items on a UCS are the Managed Information Tree (MIT) and the Application Gateways (AG’s) that do the work • This is for any UCS form factor device Blade/Rack AG: BIOS RAID CPLD Boot Method BMC Setup Alerting Etc. XML API Switch AG: Ether Port Networks QoS Policy Security Policy Linkages to Server NICs Network Segments Etc. MIT Fabric AG: Storage Segments VSAN Mappings F Port Trunking F Port Channeling Zoning ** Etc. NIC AG: # NICs Networks to Tie in QOS and Security Policy # HBAs VSANs to Tie in QoS and Security Policy Etc. Other AG’s: VMM AG Etc.
Following the Progress of AG Work • Many stages of a given process are run through (FSM-Stage) • Some can be skipped if unneeded or type of action (Shallow vs. Deep) • Almost all actions contain a verification step that the action completed • Logs are retained • View and Monitor
FSM Return Codes and System Faults • These will feed into the normal fault policy of UCSM • FSM faults are just one type – refer to the link below for listing of types • Highly recommend at least becoming familiar with layout of UCS faults and error message reference in URL below • Severity can change over the life of fault • For POC labs recommend elimination of Critical, Major, Minor faults • Others will be there in normal course of all the actions waiting and performed • http://www.cisco.com/en/US/partner/docs/unified_computing/ucs/ts/faults/reference/ErrMess.html
UCS 2xxx IO Module (Fabric Extender) • Each UCS IOM in a UCS 5100 Blade Server Chassis is connected to a 6000 Series Fabric Interconnect for Redundancy or Bandwidth Aggregation • Fabric Extender provides 10GE ports to the Fabric Interconnect • Link physical health and the chassis discovery occurs over these links UCS 6000 Series FI B UCS 6000 Series FI A UCS 5100 Series Blade Server Chassis Back UCS 2xxx Series IOM
Interface Statistics and Reports • Various points of monitoring and visibility IOM
History Statistics Breakdown Live/now • Visibility into many counters within UCS • These are “count up” with raw numbers • Use the Delta to monitor the changes over the collection intervals
Recurring Items – UCS Infrastructure • Setting a system baseline • Most of our initial issues in CPOC situations are due to firmware issues • All system components must be on same firmware package version • Host and mgmt firmware policies are excellent tools to do this – rather than server by server • Viewing the components of a package shown • When demo FI’s arrive, individually set them to a common UCSM version and erase the configurations before attempting to join them in a cluster
Recurring Items – UCS Infrastructure • Setting a system baseline • First get the package on the UCS • Create the right FW packages for the POC • Can check conformance to package in a single screen
Recurring Items – UCS Infrastructure • Setting a system baseline • Can upgrade via the Bundle Mechanism at the POC start • Bundle option there for both update and activate – handles all upgrade • This is totally disruptive, so don’t do this method during the POC (after staging)
Recurring Items – UCS Infrastructure • Upgrade prep, checkpoints, cleanup – when uptime is key (not a POC) • Implement a management interface monitoring policy • Prior to upgrading one fabric, disable all upstream data and FC ports • Disable the mgmt interface also (KVM traffic on the fabric that will not be taken down) • This will force traffic to the fabric that will be up (can quickly recover if an error) • Upgrade fabric, restore uplinks and mgmt interface • Repeat on peer fabric – but only after the cluster state is showing as HA_READY when in the CLI and you connect to local management and “show cluster extended”
Recurring Items – UCS Infrastructure • Discovery Policy vs. Re Acknowledgement behavior • Discovery policy is just that – a floor in the number of links before a chassis will be discovered • The link policy will dictate bringing up port-channels from the IOMs to Fis – after discovery • Must then re-acknowledge the chassis (disruptive to blade connectivity) for all connections beyond discovery to be used • Always re-acknowledge the chassis after it is discovered, or any cabling changes
Recurring Items – UCS Infrastructure • Multicast behavior • In all current versions, IGMP Snooping is enabled and cannot be turned off • Only the 224.0.0.X is flooded within the UCS • Fundamentally different from traditional switches which flood • We need an upstream PIM router or IGMP snooping querier upstream for proper multicast flow beyond a new flow timeout (~180 seconds)
Recurring Items – UCS Infrastructure • It is always best as a preparation to review the release notes • This is the PRIMARY method we notify the field of issues to keep aware of • Can be large with the product breadth, but for a POC or install will be a great starting step
Adapter and IOM Systems Troubleshooting CiscoLive-A# connect nxos CiscoLive-A(nxos)#show interface brief CiscoLive-A(nxos)#show interface fex-fab Ports to Blade Adaptors Internal VLAN interface for management Displayed from FI “A” Eth X/Y/Z where X = chassisnumber Y = mezzcardnumber Z = IOM port number 10 Gig Links to Chassis 2 10 GiG Links to Chassis 1
Path Tracing through UCS • How to locate the MAC of the interfaces • Find the interesting adapter in UCSM from or from the NXOS CLI #Found mac address in Fabric interconnect A. It should not be visible on Fabric interconnect B. If it is then the customer is doing per flow/packet load balancing at the host level, which is not allowed on UCS B-Series
Displaying the VIF Path on the UCS From UCS CLI From UCSM
VIF Path Details From FI NXOS: show interface veth 752 show intvfc 756 Management link
VIFs for FC Adapters • All vifs associatedwith a EthX/Y/Z interfaces are pinned to the fabric port thatEthX/Y/Z interface ispinned to. • Check the VLAN to VSAN mapping (show vlan fcoe) FarNorth-A(nxos)# show vifs interface ethernet2/1/8 Interface VIFS -------------- --------------------------------------------------------- Eth2/1/8 veth1241, veth1243, veth9461, veth9463, FarNorth-A(nxos)# shintvethernet 9463 vethernet9463 is up Bound Interface is Ethernet2/1/8 Hardware: VEthernet Encapsulation ARPA Port mode is access Last link flapped 1week(s) 1day(s) Last clearing of "show interface" counters never 1 interface resets FarNorth-A(nxos)# show vifs interface vethernet9463 Interface VIFS -------------- --------------------------------------------------------- veth9463 vfc1271, FarNorth-A(nxos)# show int vfc1271 vfc1271 is up Bound interface is vethernet9463 Hardware is Virtual Fibre Channel Port WWN is 24:f6:00:0d:ec:d0:7b:7f Admin port mode is F, trunk mode is off snmp link state traps are enabled Port mode is F, FCID is 0x710005 Port vsan is 100 FarNorth-A(nxos)# show vlanfcoe VLAN VSAN Status -------- -------- -------- 1 1 Operational 100 100 Operational
Common Helpful Outputs All baseline troubleshooting should be done from Connect NXOS CiscoLive-A(nxos)# show fcdomain domain-list vsan 100 Number of domains: 3 Domain ID WWN --------- ------------------------------------------------- 0x24 (36) 20:64:00:0d:ec:20:97:c1 [Principal] 0x40 (64) 20:64:00:0d:ec:ee:ef:c1 0xdc (220) 20:64:00:0d:ec:d0:7b:41 [Local] CiscoLive-A(nxos)# show flogi database vsan 100 ---------------------------------------------------------------------------------------------------------- INTERFACE VSAN FCID PORT NAME NODE NAME ---------------------------------------------------------------------------------------------------------- vfc703 100 0xdc0002 20:00:00:25:b5:00:00:1b 20:00:00:25:b5:00:00:2a vfc725 100 0xdc0000 20:00:00:25:b5:10:10:01 20:00:00:25:b5:00:00:0e vfc731 100 0xdc0001 20:00:00:25:b5:10:20:10 20:00:00:25:b5:00:00:2c CiscoLive-A(nxos)# show zoneset active zoneset name ZS_mn_bootcamp_v100 vsan 100 zone name Server-1-Palo vsan 100 * fcid 0xdc0000 [pwwn 20:00:00:25:b5:10:10:01] * fcid 0x2400d9 [pwwn 21:00:00:20:37:42:4a:b2] CiscoLive-A(nxos)# show fcns database vsan 100 VSAN 100: ------------------------------------------------------------------------------------------------------ FCID TYPE PWWN (VENDOR) FC4-TYPE:FEATURE ------------------------------------------------------------------------------------------------------ 0x2402ef N 50:06:01:6d:44:60:4a:41 (Clariion) scsi-fcp:target 0x2400d9 NL 21:00:00:20:37:42:4a:b2 (Seagate) scsi-fcp:target 0x400002 N 50:0a:09:88:87:d9:6e:b7 (NetApp) scsi-fcp:target 0x40000e N 10:00:00:00:c9:9c:de:9f (Emulex) ipfcscsi-fcp:init 0xdc0000 N 20:00:00:25:b5:10:10:01 scsi-fcp:init fc-gs 0xdc0001 N 20:00:00:25:b5:10:20:10 scsi-fcp:init fc-gs 0xdc0002 N 20:00:00:25:b5:00:00:1b scsi-fcp:init Total number of entries = 6
NPV FC Views FarNorth-B(nxos)# show npvflogi-table ------------------------------------------------------------------------------------------------------------------ SERVER EXTERNAL INTERFACE VSAN FCID PORT NAME NODE NAME INTERFACE ------------------------------------------------------------------------------------------------------------------ vfc1205 100 0x24000720:00:00:25:b5:00:00:0a20:00:00:25:b5:00:00:06fc2/1 vfc1206 100 0x24000620:00:00:25:b5:00:00:0920:00:00:25:b5:00:00:06fc2/1 vfc1210 100 0x24000820:00:10:25:b5:00:00:0920:00:00:10:b5:00:00:09fc2/2 vfc1238 100 0x24000220:00:00:25:b5:00:00:1020:00:00:25:b5:00:00:0ffc2/1 vfc1240 100 0x24000320:00:00:25:b5:00:00:0420:00:00:25:b5:00:00:0ffc2/2 Total number of flogi = 5. • No FC services running in NPV Mode • FCIDs assigned from Core NPIV switch • NP port to core Switch must be up and assigned to proper VSANs FarNorth-B(nxos)# show npv status npiv is enabled disruptive load balancing is disabled External Interfaces: ==================== Interface: fc2/1, VSAN: 100, FCID: 0x240000, State: Up Interface: fc2/2, VSAN: 100, FCID: 0x240001, State: Up Number of External Interfaces: 2 Server Interfaces: ================== Interface: vfc1205, VSAN: 100, State: Up Interface: vfc1206, VSAN: 100, State: Up Interface: vfc1210, VSAN: 100, State: Up Interface: vfc1238, VSAN: 100, State: Up Interface: vfc1240, VSAN: 100, State: Up Interface: vfc1270, VSAN: 100, State: Up Interface: vfc1272, VSAN: 100, State: Up Interface: vfc1280, VSAN: 100, State: Up Interface: vfc1284, VSAN: 100, State: Up Number of Server Interfaces: 9 FarNorth-B(nxos)# show int brief ------------------------------------------------------------------------------- Interface Vsan Admin Admin Status SFP OperOper Port Mode Trunk Mode Speed Channel Mode (Gbps) ------------------------------------------------------------------------------- fc2/1 100 NP off up swl NP 2 -- fc2/2 100 NP off up swl NP 2 -- fc2/3 1 NP off sfpAbsent -- -- -- fc2/4 1 NP off sfpAbsent -- -- -- fc2/5 1 NP off sfpAbsent -- -- -- fc2/6 1 NP off sfpAbsent -- -- -- fc2/7 1 NP off sfpAbsent -- -- -- fc2/8 1 NP off sfpAbsent -- -- --
Server Systems Troubleshooting • Server Upgrade Items • Do NOT do a BIOS recovery as a mechanism to perform an upgrade of BIOS • We should do this through the update method (M3 Blades) or Host FW package • In General, we want the CIMC version to be greater than the BIOS version as the data returned from BIOS to CIMC and properly understanding it (delta in documentation today) • All firmware components must be from same B (blade components) and C (rack components) packages, matched to the A (infrastructure) package
Server Systems - CIMC Booting Issues • Corrupt CIMC Firmware • POST Failure • Not completing boot • Connecting to CIMC in band to test connectivity • Manually reboot CIMC • **Note, today there is a bug in B230 and B440 where network performance can be negatively affected on CIMC only reboot on VMware hosts
__________________________________________ Debug Firmware Utility __________________________________________ Command List __________________________________________ alarms cores exit help [COMMAND] images mctools memory messages network obfl post power sensors sel fru mezz1fru mezz2fru tasks top update users version __________________________________________ Notes: "enter Key" will execute last command "COMMAND ?" will execute help for that command __________________________________________ Connecting to CIMC • A quick test to verify the health • This is a very low level data point • Source of blade issue reporting CiscoLive-A# connect cimc 1/1 Trying 127.5.1.1... Connected to 127.5.1.1. Escape character is '^]'. CIMC Debug Firmware Utility Shell
Rebooting the CIMC • Non disruptive to data path ** • ** with exception of the current bug on VMware environments
Server Systems Troubleshooting • KVM Access • Independent of Centrale • UCS AAA Login
Server Systems - Memory • This will show errors detected and reported by BIOS and the CIMC • These are also stored in the System Event Log (SEL) • Uncorrectable are an issue, Correctable is making use of ECC parity CiscoLive-A /chassis/server # show sel 3/1 | include Memory 487 | 03/18/2011 00:16:49 | BIOS | Memory #0x02 | UncorrectableECC/other uncorrectable memory error | RUN, Rank: 0, DIMM Socket: 4, Channel: C, Socket: 0, DIMM: C4 | Asserted 5f1 | 04/16/2011 09:53:12 | BIOS | Memory #0x02 | UncorrectableECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 7, Channel: A, Socket: 0, DIMM: A7 | Asserted 731 | 04/21/2011 01:59:28 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 1, DIMM Socket: 1, Channel: B, Socket: 0, DIMM: B1 | Asserted 732 | 04/21/2011 10:50:55 | BIOS | Memory #0x02 | UncorrectableECC/other uncorrectable memory error | RUN, Rank: 2, DIMM Socket: 6, Channel: A, Socket: 0, DIMM: A6 | Asserted 799 | 04/29/2011 02:50:31 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 0, DIMM Socket: 0, Channel: B, Socket: 0, DIMM: B0 | Asserted 79a | 04/29/2011 04:41:33 | BIOS | Memory #0x02 | UncorrectableECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 3, Channel: B, Socket: 0, DIMM: B3 | Asserted
Server Systems - Memory • We want to know of both correctable (for prediction of failure) and uncorrectable via threshold policy
Stress Testing and Baseline • Cycling Through Servers Performing Testing on Deployed Hardware • Evacuate the VMs from a given server and put in maintenance mode • Mount the e2e diagnostic .ISO and reboot the server to it • Run utilities to stress test the memory and CPU • Test 1: ./burnin/bin/stress –c 8 –i 4 –m 2 –-vm-bytes 128M –t 100s –v • Test 2: ./burnin/bin/pmemtest –a –l 1000000000 • Test 3: ./burnin/bin/stream • Test 4: ./burnin/bin/cachebench -rwbsp -x1 -m24 -d5 -e1 • DO NOT RUN THE DISK STRESS (will corrupt the existing RAID) • Record the results • Remove .ISO and reboot VMware to exit maintenance mode • Identify any suspect devices from tests and plan for maintenance of that item
Server Systems Troubleshooting • Initial Server Deployment or Suspected Issues • Example Results from one Customer POC: B200-M2 / X5570 / 96G B230-M1 /X6550 / 256G Test #1 1m 40s 1m 40s Test #2 50s 1m 20s Test #3 5m 4s 5m 11s Test #4 13m 45s 13m 46s
Operating Systems Troubleshooting • Windows Items • With the latest BIOS on B230 and B440 M1, the PCI devices are ordered correctly on 1.4 to 2.0 upgrade, but interfaces can be renumbered regardless – fix coming • We can define PCI order, but the adapter definitions to the OS are dependent on the order you map the VIC driver to them • Red Hat Items • We have very good control over these, using the /etc/sysconfig/network-scripts to map the HW address to the eth number • There are kernel parameters which can affect performance – contact TME teams directly • ESX Items • In box drivers occasionally need to be updated • Due to time sync requirements for inbox deployments (can be 6+ months)
Chassis Troubleshooting • Intra chassis component communications • Inter-Integrated Circuit communications (I2C) • Systems Management Bus was later subset • Multi-Master Bus for simple communications between system elements • In use inside a standard industry server, and also between chassis components (inside a single chassis only) • I2C bug cases with some components coming too close to certain margins • Locking the I2C bus • Creating spurious noise on the bus • Manifests in unpredictable behavior • What does this mean for POC and Initial Customer Deployments? • Be certain to be running a software at/later than 1.4(3s) which includes SW fixes to these situations – for additional HW margin increments: • Power supplies should be ordered as MFG_NEW if possible • IO Modules that are 2104 should be ordered as MFG_NEW if possible
Fabric Interconnect Troubleshooting • 6100 Top Considerations • 3k prior to UCS 1.4(1), then 6k to UCS 1.4(1), 14k P*V Count Limit as of UCS 1.4(3q) • VIF limits can be very restrictive in C series implementations • 6200 Top Considerations • 32k P*V Count Limit at UCS v2.x • Multicast when using Port Channels upstream (only do on UCS v2.0(2) and later)
Fabric Interconnect Troubleshooting • Gathering Tech Support Files • We have the ability to gather the tech support data from UCSM to your localhost • Always recommend gathering when asking questions to various internal mailers
Fabric Interconnect Troubleshooting • Gathering any Core Dumps • Once TFTP core exporter is configured, they will be moved off the system • Move exported cores to the trash can
Fabric Interconnect Troubleshooting • Viewing data plane traffic within the UCS • We can SPAN from most sources within the UCS • Can SPAN the physical and virtual interfaces Hardware or software Analyzer