150 likes | 309 Views
Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th , 2009 Hansang Bae Senior VP| Citi ( f.k.a Citigroup) hbae@nyc.rr.com SHARK FEST '09 Stanford University June 15-18, 2009. Challenges:. As it turns out, size does matter!
E N D
Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16th, 2009 Hansang Bae Senior VP| Citi (f.k.a Citigroup) hbae@nyc.rr.com SHARKFEST'09 Stanford University June 15-18, 2009
Challenges: • As it turns out, size does matter! • Citi’s branch network spans 5,000+ locations in the US • Citi’s network infrastructure includes 30,000+ devices • 300,000 users located in over 100 countries. • Number of servers in use is mind numbingly large! • Compliance/Security Quagmire • Doing a full packet capture is difficult. • Tools in use include NetVCR and Opnet’s ACE. • Wireshark is the only approved protocol analyzer at Citi. It dislodged past market leaders.
Act I: Much Ado about Nothing! • Old medical school saying: When you hear hooves beating, think horses and not zebras! • Server SA reports extreme slowness during file transfers • What are the top issues that come to mind? • Server SA started a ping script and in it showed….. • Lessons Learned: • Learn to recognize what should and should not change as you go through the trace files. • RFC1323 was not in play because they are on the same switch! • Take a few minutes to scan the trace files. Learn to trust your brain’s ability to spot differences. • Know how protocols work so you can rule out red-herrings. This is what separates “techs” from “engineers” • Try not to filter. You might have missed the “arp” frames in this trace. This is different than capturing in “promiscuous” mode.
Act II: Taming the SSH • Logging into a server via ssh takes over two minutes: • What are the top issues that come to mind for slow telnet/ssh login? • Let’s capture and find out. Packet captures are like Shakira’s hips. They don’t lie! • Lessons Learned: • Scroll through the trace to look for patterns. Again, trust your brain. • Develop a technique; a list of common filters to run through when troubleshooting. e.g. tcp.flags==02, tcp.analysis.flags • Don’t forget UDP. What important function runs on UDP? • Do not blindly trust the tcp analysis. Wireshark can only know what you feed it. It too suffers from GIGO (Garbage In, Garbage Out) • Use the graphical tools available in Wireshark. Picture *IS* worth a thousand words! • Capture placement is important. If I captured at the client, I would still be wondering why there is a delay!
Act III: To Stream or Not to Stream? • Application developers report extreme slowness when ftp’ing a file. • What are the top issues that come to mind for slow ftp sessions? • Lessons Learned: • Scroll through the trace to look for patterns. Again, trust your brain. • Develop a technique; a list of common filters to run through when troubleshooting. e.g. tcp.flags==02, tcp.analysis.flags • Buffer tearing is pretty common. Applications are constantly trying to do TCP’s job. App bytes can help you identify it. Learn to recognize it! (Oracle, MS SQL, Sybase, they all do it) • Understand what “streaming” really means. TCP *HAS NO* byte boundaries. • Use the graphical tools available in Wireshark. Picture *IS* worth a thousand words!
Act IV: Window’s Tale • Call center servers are not able to keep up with call volume after a data center migration • The servers are not getting the data fast enough - causing a backlog. What simple change can increase the throughput? • The path after the migration is longer by 50 ms. • Lessons Learned: • If latency is causing a problem, look for RFC1323 related problems. • Know what affects a transfer throughput. Buffer tearing, window sizes, or packet loss. • Use the graphical plots to zoom in on the problem– so let’s look at the window size. Should we look at the receive or send window? • Argue your case. If you’re right, you’re right! But you had better be right. You earn your “cred” over time, but you can blow it in one shot! • Use the graphical tools available in Wireshark. Picture *IS* worth a thousand words! See next page.
Act IV: Window’s Tale Use STATISTICS, IO GRAPH to bring up this graph. Modify the highlighted items to bring up this view
Act V: A User’s Complaint • Smith Barney Financial Consultants are complaining of slow page load times for their home page. The problem is sporadic and random but happens enough that it’s impacting their productivity. • The problem is wide-spread, not easily reproducible….where do you start? What do you do? “Who you gonna call?” • What’s common in the problem? Home page; use of load balancer; common backend servers; affecting many users. • What’s the job of a load balancer? • Where should we take the trace? • What “bad things” can happen if you are using a load balancer with Source NAT configured?
Act V: A User’s Complaint (con’t) • Lessons Learned: • Start by looking at what infrastructure is in common for all users experiencing the problem. • What constitutes a TCP packet? 2-Tuple? 4-Tuple? • Remember that sequence numbers are nothing more than the number of bytes transferred. Acknowledgement is nothing more than an indication of how much of the data you received. You receive something outside of what’s expected, something went horribly wrong! • When you have a 22,000 user base, having a ephemeral port range of 1024-5000 can be exhausted quickly. • Sometimes, you have to resort to turning off “relative sequence numbers” for analysis. This is especially true when load balancers – or any device that NATs – is in the data path.
Act V: A User’s Complaint (con’t) • Lessons Learned (con’t): (Turn off relative sequence numbers) • Frames 1-8 contain the orderly close of a connections. • Frame 9 which occurs approx. 14 seconds later is an attempt of a ‘new’ client to open a connection to the LB. (Frame 10 is the LB translated request to the web server). • Frame 11 is an acknowledgement for the prior connection. This occurs, because the Web server still has this socket in FIN-WAIT. (Frame 12 is the translated request – LB to client). • Frames 13 and 14 is the RST generated by the client, and the translated request, respectively. • Frames 15-18 contain a connection creation. This is allowed to occur because of the RST. However, this causes the client to pause for approx 3. Seconds.
Act VI: As You Log It • After a data center migration, an application was no longer able to support the production traffic. The new data center was separated by 11ms round trip latency. Before the move, both servers were located in the same DC • Naturally, first inclination was to blame the network! After all, the problem started after the migration. • The application generates a 3 byte “alert” message followed by another small packet with the actual data. • What should be the first problem that comes to your mind? • What looked like a slam-dunk turned out be quite complicated! • In the Army, we had a saying: Be, Know, Do. It applies to packet analysis. • At the end of the day, in depth knowledge of how TCP should work allowed us to find the problem.
Act VI: As You Log It (con’t) • Lessons Learned: • Nagle and Delayed Acknowledgment deadlock is very common when TCP is used to shuttle small amounts of data. • This can be a “killer” when trading programs are involved. • Turning on application level logging can help, but don’t forget to turn it off! • Know what impact you can have if you decide to log. For us router-jockeys, it’s equivalent to doing a “debugipospf” on a production backbone router. Hint: not a good idea. It’s a self correcting error – if you do it once, you’ll never do it again! • If you know how TCP really works, you can argue your point with conviction because deep down inside, you know you’re right.
Appendix: IP’s used in the examples • ACT I: ICMP_BHNew*pcap • 192.168.1.1 and 192.168.1.254 are servers on the same switch. • ACT II: SlowSSHLoging2.pcap: • 192.168.1.1 is the client. 172.16.50.50 is the ssh server. 192.168.75.75 and 192.168.200.200 are NIS+ servers. • ACT III: SlowFtpAnon.pcap • 10.10.10.10 is the ftp server. 192.168.1.1 client is pulling the file from the server. • ACT IV: MQSlow.pcap • 172.16.50.50 is the MQ server. 192.168.1.1 is the MQ client. The server is pushing the file to the client. • ACT V: LBProblemNew.pcap • 10.2.53.102 and 10.17.97.111 are users in different branches. 172.16.10.10 and 172.16.20.20 belong to the load balancer. 172.16.254.254 is the real web server. 172.16.10.10 is end user facing IP of the LB and 172.16.20.20 is the IP used by the LB for source NAT’ing when talking to the real web server. • ACT VI: DCMove_*.pcap • 192.168.1.102 and 172.16.1.125 are two servers involved in the transfer. Both send data independently of one another. • Please email me at hbae@nyc.rr.com if you would like the “The Tool” Visio macro.