On The Trail of a JVM Bug
I recently had the opportunity to track a bug past my application code into the JVM and beyond. I learned a few things along the way.
I found a bug in an application I was working on that I narrowed down to a concurrency issue. Threads were deadlocked somewhere. I couldn’t figure it out in the application so I wrote the most minimal app I could that reproduced the problem. Here’s that code (I removed various System.out… calls for brevity):
public class Foo { private int counter; public static void main(String[] args) throws Exception { new Foo().run(); } public void run() throws Exception { Thread[] threads = new Thread[2]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread() { public void run() { for (int i = 0; i < 1000000000; i++) { increment(); } } }; } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); } } private synchronized void increment() { counter++; } }
The code above deadlocks both created threads in the counting loop; zero CPU usage. I didn’t see any reason that should happen so I sent the code sample to some coworkers. They didn’t see a problem either. At this point I started trying to isolate environmental issues. I was running this hardware, OS, and JVM:
MacBook Pro 2.3GHz Intel Core i7 8GB 1333MHz DDR3 OS X 10.6.8 with 64 bit enabled Java version 1.6.0_29 Java(TM) SE Runtime Environment (build 1.6.0_29-b11-402-10M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02-402, mixed mode)
The deadlock didn’t happen if I used a Lock instead of a synchronized method. It also didn’t happen if I ran the JVM in 32 bit mode, only in 64 bit mode.
I started considering the possibility of a JVM issue. That’s not a conclusion to jump to lightly so I sent the problem to a prominent Java expert who in turn involved two others. I haven’t their permission to give their names. One of them was able to reproduce the issue running similar hardware.
Next we tried running on 2 other JVMs: openjdk version 1.7.0-ea and soylatte java version 1.6.0_03-p3. There was no deadlock on either of these JVMs. It was starting to look a lot like a JVM problem.
Once we determined it wasn’t a problem with the application code we ran the app until it deadlocked then forced a core dump. To force a core dump from the command line:
- ulimit -c unlimited – set this to allow core dumps
- kill -SIGSEGV pid – send a segmentation fault signal to force the app to crash
The core file is saved in /cores/core.pid. There is also a *.crash file saved in [user home]/Library/Logs/DiagnosticReports.
The crash file analysis showed this info for the 2 threads:
Thread 24: Java: Thread-1 0 libSystem.B.dylib 0x00007fff89f88d7a mach_msg_trap + 10 1 libSystem.B.dylib 0x00007fff89f893ed mach_msg + 59 2 libclient64.dylib 0x000000010100d903 jio_snprintf + 37641 ... Thread 25: Java: Thread-2 0 libSystem.B.dylib 0x00007fff89f88d7a mach_msg_trap + 10 1 libSystem.B.dylib 0x00007fff89f893ed mach_msg + 59 2 libclient64.dylib 0x000000010100db03 jio_snprintf + 38153 ...
They are both waiting on a message from the kernel. It appears to be an issue with the kernel not delivering the message instead of a JVM problem but we aren’t ready to declare any definite answer yet. The next step is the core dump analysis. Wherever this leads or whether we definitively identify the problem, its been an interesting experience.
It’s been my experience that sometimes, with non-Java applications, I had things lock up in what appears to be a missed message. You might want to send this to Apple I’m sure the the Mac-Port guys would be very interested in it. (macosx-port-dev-request@openjdk.java.net)