1 / 29

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures. Joseph Dane University of Hawaii. © 2005 University of Hawaii. Overview. Motivation Concepts Examples Resources. Why? When? Who?. Why: because you’ve got a flaky application, and you don’t know the cause

odeda
Download Presentation

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Portal PostmortemUsing Java Thread Dumps to Diagnose Application Failures Joseph Dane University of Hawaii © 2005 University of Hawaii

  2. Overview • Motivation • Concepts • Examples • Resources

  3. Why? When? Who? • Why: because you’ve got a flaky application, and you don’t know the cause • This is your only choice (except, perhaps …) • When: now. Or possibly later. • You do need a running JVM • If you’ve got a dead JVM, you might have a core • Who: java developers • requires some familiarity with java • threading, locks, etc. in particular • but, no guru level knowledge required

  4. General Principles • The goal is not to solve problems, but to add to the pool of knowledge about the problem • Some sysadmin skills required/useful • Finding PIDs, sending signals, munging text • For some problems (e.g. G.C.), this will all be a complete waste of time • For other (fairly rare) problems, this will immediately point out the problem, e.g. deadlocks • “Postmortem”, in that we can (must, in fact) wait until a problem presents

  5. What can we learn? • Resource contention • network (database, RSS feed, etc.) • java locks • CPU • Not so good for • memory leaks

  6. Terminology • “thread” -- a single execution path through a program • “method” -- a java function call • “stack” -- the context of the currently called methods/functions • “frame” -- a single method call, or a single line in a stack • “up/down” -- “up” is toward the TOS, “down” is toward the start of the program • “TOS” -- “top of stack”, or the currently executing method • “monitor” -- a java lock • only one thread can hold a given lock at a time

  7. The Stack public void A() { Z(); B(); } public void B() { C(); } public void C() { … } public void Z() {} In words: A calls Z, then B, which calls C. C is at the TOS, and is the currently executing method. Z was called previously by A, but is not currently executing, so is no longer on the stack. C B A

  8. The Thread Dump • A snapshot of all activity in the program at a given instant • All JVMs can produce them, in different ways and formats • A text file, in various formats • All include a stack trace per live thread. • Some include explicit lock information. • Generating a thread dump • UNIX: send the QUIT signal to the JVM • Linux: send to the root of the process tree • Mac OS: apparently some flakiness in 1.4 JVM signals • TIP: use -XX:+JavaMonitorsInStackTrace on Sun 1.3.1 • Windows: CTRL-Break • Need to arrange for the service to interact with the desktop. • Redirect output to a file. • jstack (tiger) • Not available for windows? Can also show native frames. • Generating the dump is usually safe.

  9. A Stack Trace "tcpConnection-8443-0" daemon prio=5 tid=0x005469d0 nid=0x19b2800 in Object.wait() [f1112000..f1112ac0] at java.lang.Object.wait(Native Method) - waiting on <0x650c01f0> (a java.lang.Object) at com.caucho.server.TcpServer.accept(TcpServer.java:648) - locked <0x650c01f0> (a java.lang.Object) at com.caucho.server.TcpConnection.accept(TcpConnection.java:211) at com.caucho.server.TcpConnection.run(TcpConnection.java:132) at java.lang.Thread.run(Thread.java:552)

  10. The Thread Dump, Pictorially Time T1 T2 T3 T4 T5 T6

  11. Deadlock example Found one Java-level deadlock: ============================= "T2": waiting to lock monitor 0x0180ed00 (object 0x2a563978, a java.lang.String), which is held by "T1" "T1": waiting to lock monitor 0x0180ecdc (object 0x2a5639a8, a java.lang.String), which is held by "T2" Java stack information for the threads listed above: =================================================== "T2": at hawaii.jasig.DL$Locker.run(DL.java:47) - waiting to lock <0x2a563978> (a java.lang.String) - locked <0x2a5639a8> (a java.lang.String) at java.lang.Thread.run(Thread.java:613) "T1": at hawaii.jasig.DL$Locker.run(DL.java:47) - waiting to lock <0x2a5639a8> (a java.lang.String) - locked <0x2a563978> (a java.lang.String) at java.lang.Thread.run(Thread.java:613)

  12. The tool • Far too many threads for human consumption. • Written in a good text processing language. • Main concept: group related threads together. • Simplest definition of “related” lexical identity of stack frame. • Keep it Simple • Work on the tool ahead of time, verify that it works as expected. • Basic process: learn, hypothesize (I.e. guess), test, repeat.

  13. Sample tool output 1: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1920) 1: at java.lang.Object.clone(Native Method) 1: at org.apache.xml.utils.TreeWalker.startNode(TreeWalker.java:274) 2: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1914) 2: at com.iplanet.server.http.servlet.NSServletSession.writePrivate(Native Method) 3: at org.jasig.portal.UserInstance.renderState(UserInstance.java:282) 5: at com.pipeline.email.engine.ExternalSessionIMAP4.getPersonalNamespace(ExternalSessionIMAP4.java:243) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 14: at com.pipeline.email.engine.ExternalSessionIMAP4.getJMCacheSubfolders(ExternalSessionIMAP4.java:599) 21: at java.lang.Thread.sleep(Native Method) 47: at java.net.SocketInputStream.socketRead(Native Method) 262: at java.lang.Object.wait(Native Method) total of 365 threads

  14. Step 0 -- Preparation • Monitoring • Manual, e.g. “top” “ps”, etc. • Automatic, e.g. MRTG • Learn about normal behavior. • Number of threads? Distribution? • Response plan • Notifications. Who gets called? • Gather wide expertise. What might be changed? A new switch or firewall? Has registration started? • Analysis plan • Access to the source is important, probably essential • Organize. New directory for each incident, each dump in a separate file. Keep notes as you work.

  15. Step 1 -- Something’s Wrong! • Generate the dump. Maybe a couple of dumps, but one is probably OK for now. • Maybe clean the dump file. • depends on how smart your tool is • Check the system’s vital signs. • Running out of memory? • CPU bound? Idle CPU often means a concurrency problem, or an unavailable resource.

  16. Step 2 -- What’s happening? • 10/27/2004. production portal becomes non-responsive • First, try looking at the tops of the stacks. See anything odd? [EXAMPLE 1] (2004102701) 1: at com.pipeline.grouptools.lib.GTSession.initialize(GTSession.java:203) 2: at java.util.HashMap.put(HashMap.java:324) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 20: at java.lang.Thread.sleep(Native Method) 47: at java.net.SocketInputStream.socketRead(Native Method) 133: at java.lang.Object.wait(Native Method) 235: at netscape.ldap.util.ConnectionPool.getConnFromPool(ConnectionPool.java:208) total of 444 threads

  17. [EXAMPLE 2] (ex2) 3: at java.net.PlainSocketImpl.socketAccept(Native Method) 6: at java.lang.Thread.sleep(Native Method) 16: at java.net.SocketInputStream.socketRead0(Native Method) 33: at java.lang.Object.wait(Native Method) total of 58 threads • Often the tops are not that useful • Lots of things call Object.wait. Best to ignore this for now. • 16 Threads insocketRead0. Try looking down the stack interesting traces ... 16 interesting traces 17 frames down ... 10: at edu.hawaii.portal.channels.jasig.ReallySlowChannel. setRuntimeData(ReallySlowChannel.java:47) total of 10 threads

  18. Thread 2 Thread 3 Thread 4 Thread 1 edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData

  19. Thread 1 Thread 2 Thread 3 Thread 4 edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData

  20. Reality is usually more complicated … [EXAMPLE 3] (2004120101) 6: at java.net.PlainSocketImpl.socketAccept(Native Method) 9: at com.example.foo.Bar(Bar.java:559) 9: at com.iplanet.server.http.servlet.NSServletSession.ostream_flush(Native Method) 9: at org.jasig.portal.UserInstance.renderState(UserInstance.java:267) 15: at java.lang.Thread.sleep(Native Method) 36: at org.jasig.portal.UserInstanceManager.getUserInstance(UserInstanceMana ger.java:76) 72: at com.example.foo.Foo.setFoo(Foo.java:596) 99: at java.net.SocketInputStream.socketRead(Native Method) 139: at com.example.foo.Foo.getFoo(Foo.java:534) 142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHt tpSession.java:705) 232: at java.lang.Object.wait(Native Method) total of 788 threads

  21. When TOS isn’t enough • Sometimes you need to identify threads which pass through certain methods . • “pass through” meaning “currently executing” • Call these methods (and threads) “interesting”. • Some threads may be blocked in the interesting methods. Some may have made it through to other methods.

  22. “interesting” example public class Example { private Object aLock = new Object(); public void foo() { synchronized (aLock) { bar(); } } private void bar() { // calls other methods … } }

  23. Let’s call removeAllObjects “interesting” and see what we get … interesting traces ... 212 interesting traces interesting TOS ... 70: at com.example.Foo.setUserFoo(Foo.java:596) 142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHttpSession.java:705) • 212 threads made it into or past removeAllObjects. • of these, 142 are blocked inside that method (why? dunno) and 70 have made it past, and are now blocked in setUserFoo • Time to check the source …

  24. setUserFoo(): synchronized( aLock ) { … } getUserFoo(): synchronized( aLock ) { … } _getUserFoo(): synchronized( aLock ) { … }

  25. So, let’s find all locations where aLock is obtained • Call these methods “interesting”, find all threads passing through these methods. 1: at java.net.SocketInputStream.socketRead(Native Method) 9: at com.example.foo.Foo.getUserFoo(Foo.java:559) 72: at com.example.foo.Foo.setUserFoo(Foo.java:596) 139: at com.example.foo.Foo.getUserFrob(Foo.java:534) total of 221 threads Bingo! One thread in socketRead! Find that thread!

  26. nid=0xc9 suspended [0x8dc77000..0x8dc79668] at java.net.SocketInputStream.socketRead(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:85) at oracle.net.ns.Packet.receive(Unknown Source) at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.net.ns.NetInputStream.read(Unknown Source) at oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:726) at oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:698) at oracle.jdbc.ttc7.Oclose.receive(Oclose.java:105) at oracle.jdbc.ttc7.TTC7Protocol.close(TTC7Protocol.java:533) at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:639) … TIP: Learn to recognize the unexceptional

  27. Resources • jstack, jconsole • BEA analysis tool. • uses XSLT to produce graphs, analysis • http://dev.bea.com/products/wlplatform81/articles/thread_dumps.jsp • sample code • http://www2.hawaii.edu/~jdane/jasig

  28. Finally • If you haven’t considered this method (as a “real” method), you should. • Success rate: 20/50/30 • don’t neglect QC

More Related