Bug 212262 - JRE can hang with circular dependencies
Summary: JRE can hang with circular dependencies
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: Framework (show other bugs)
Version: 3.3.1   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 3.5 M7   Edit
Assignee: Thomas Watson CLA
QA Contact:
URL:
Whiteboard:
Keywords: Documentation
: 215834 221329 227587 301640 331818 354844 362154 364202 369917 377609 389659 394363 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-12-07 08:49 EST by John Wells CLA
Modified: 2013-09-30 08:43 EDT (History)
26 users (show)

See Also:


Attachments
Sun JDK 1.5.0_12 stack dumps on failure (91.10 KB, text/plain)
2007-12-07 08:50 EST, John Wells CLA
no flags Details
Jrockit 1.5.0_12 with stack traces when failure happens (106.27 KB, text/plain)
2007-12-07 08:56 EST, John Wells CLA
no flags Details
Stacks showing a ghost lock (6.63 KB, text/plain)
2008-08-31 18:16 EDT, Stephan Herrmann CLA
no flags Details
deadlock in disabled checkCerts (24.86 KB, text/plain)
2009-01-20 11:48 EST, Stephan Herrmann CLA
no flags Details
work in progress (6.04 KB, patch)
2009-04-22 15:10 EDT, Thomas Watson CLA
no flags Details | Diff
work in progres 2 (5.51 KB, text/plain)
2009-04-22 16:37 EDT, Thomas Watson CLA
no flags Details
updated patch (7.03 KB, text/plain)
2009-04-23 12:58 EDT, Thomas Watson CLA
no flags Details
final patch (8.29 KB, patch)
2009-04-23 15:09 EDT, Thomas Watson CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Wells CLA 2007-12-07 08:49:14 EST
Build ID: I20070625-1500

Steps To Reproduce:
1. Have a sufficiently complex application
2. Run it many times (this happens approximately 1/50 runs)
2a.  Use Sun JDK 1.5.0_12 or Jrockit 1.5.0_12
3. Wait until it gets stuck


More information:
I will attach a file that contains the stack dumps when this happens.  In the Sun case it seems to get stuck in "defineClass" while in the jrockit case it seems to get stuck elsewhere.  When this happens we see several threads stuck waiting for the Classloader.

Other observations:
1.  We have not been able to reproduce this problem when we set osgi.classloader.singleThreadLoads=false (we went up to 3000 4 minute runs).
2.  The application we have uses Spring, ActiveMQ and our own declarative system for wiring things.  It is fairly complex.  We have *not* seen this with less complex suites.
3.  This is *not* the PermGen bug described here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6320642.  We have run this test with the PermGen space jacked high (256m) and the problem still occurs.  Furthermore, one of the characteristics of that failure is the jdk spins and we are seeing no CPU load at all when this happens.
4.  There is no Java deadlock.  No Java lock cycles are causing this.  Instead the JRE seems to be stuck in native code!

We are continuing to investigate with our own JRockit team (they are getting into the debugging of the native code) but I figured there might be people here who have seen this or know what we can do to fix it!
Comment 1 John Wells CLA 2007-12-07 08:50:28 EST
Created attachment 84728 [details]
Sun JDK 1.5.0_12 stack dumps on failure
Comment 2 John Wells CLA 2007-12-07 08:56:20 EST
Created attachment 84729 [details]
Jrockit 1.5.0_12 with stack traces when failure happens
Comment 3 John Wells CLA 2007-12-07 09:00:41 EST
BTW, we *know* this happens on Sun and BEA JVMs.  We have not tried other VMs so we don't know whether or not this happens elsewhere.  Like... for example with the IBM JVM ;-)
Comment 4 Thomas Watson CLA 2007-12-07 10:24:14 EST
I don't see this in the dumps, but in your system do you have other classloaders that sit on top and/or do not participate in the global lock?  This is likely a red herring but I thought the Spring-OSGi stuff used its own classloaders sometimes for proxy classes.  If that is true maybe that classloader is locked and the guts of the VM is unable to lock it while defining a class in another thread.

Again the dumps do not seem to indicate this, but it was a thought I had while reading through your dumps.  My other guess is that the VM does not like us releasing the lock that the native VM established.  I know there used to be Sun VM bugs around this but I thought they had gotten fixed in the latest 1.5 VMs after 1.5.0_08.(In reply to comment #3)

> BTW, we *know* this happens on Sun and BEA JVMs.  We have not tried other VMs
> so we don't know whether or not this happens elsewhere.  Like... for example
> with the IBM JVM ;-)
> 

The IBM vm is quite different with respect to locking the classloader from the native VM while loading a class (it does not lock the classloader).  Without testing the complex environment we cannot say for certain, but if the native VM lock is causing the issue then it should not be an issue on the IBM VM.
Comment 5 John Wells CLA 2007-12-11 10:19:29 EST
We have been "spinning" the test suite that causes the problem with the IBM VM and have not seen the problem in approximately 4000 runs.  So I think you are correct that this bug does *not* show up in the IBM VM.  We are continuing to spin it (probably all through the week) just to be "sure".
Comment 6 Thomas Watson CLA 2008-04-21 10:45:12 EDT
See bug 227587 for a detailed description of why this deadlock occurs.
Comment 7 Thomas Watson CLA 2008-04-21 10:46:16 EDT
*** Bug 227587 has been marked as a duplicate of this bug. ***
Comment 8 Thomas Watson CLA 2008-04-21 11:11:31 EDT
Note that bug 121737 introduced the osgi.classloader.singleThreadLoads.  Please see that bug for details on what that option does and how it was supposed to solve the deadlock issues.  It appears the option is pretty much useless with some of the latest VMs classloader->classname locking strategies.
Comment 9 Andy Piper CLA 2008-05-23 04:49:00 EDT
From the JR team:

I have looked through the source code attached to this CR and found two circular dependencies. These could lead to deadlocks under some conditions. It is far from sure that this is your problem, but they are worth looking into:

STATE_OBJECT_FACTORY and STATE_OBJECT_FACTORY_IMPL:

============== eclipse/osgi/service/resolver/StateObjectFactory.java:

public interface StateObjectFactory {
...
  public static final StateObjectFactory defaultFactory = new StateObjectFactoryImpl();

============== org/eclipse/osgi/internal/resolver/StateObjectFactoryImpl.java:

public class StateObjectFactoryImpl implements StateObjectFactory {
...

----------------------------------------------------------

In the StateObjectFactory interface, we initiate a static field with a call to a method in StateObjectFactoryImpl and StateObjectFactoryImpl extends from class StateObjectFactory. This means that if two threads, A and B tries to initiate the two classes at exactly the same time, we could get a deadlock (Classes are initiated the first time a static method in them is read or modified, or the first time a method in them is called):

* A starts to initiate class StateObjectFactory
* B starts to initiate class StateObjectFactoryImpl
* A comes to 'new StateObjectFactoryImpl()'
    To call this function, class StateObjectFactoryImpl must be initialized,
    but another thread (B) is currently loading it.
* A will therefore wait until B has initialized class StateObjectFactoryImpl
* B comes to 'implements StateObjectFactory'
    Before a class is initialized, its superclass must be initialized
    but another thread (A) is currently loading it.
* B will therefore wait until A has initialized class StateObjectFactory

We have a deadlock!

We have a similar case, but that might not be a problem:

CONDITION and BOOLEAN_CONDITION:

============== org/osgi/service/condpermadmin/Condition.java

public interface Condition {
...
  public final static Condition TRUE  = new BooleanCondition(true);
  public final static Condition FALSE = new BooleanCondition(false);
...
}
final class BooleanCondition implements Condition{
...

----------------------------------------------------------
(This one might not be a problem, since BooleanCondition is not a private class, and can only be created from inside Condition.)

INVESTIGATING FURTHER:

The most interesting dependency is the first one, StateObjectFactory => StateObjectFactoryImpl => StateObjectFactory

The best would be to get rid of this dependency, for example by not setting the variable in the class initializer, but instead having an initializer method somewhere. One could have an initializer method that is called explicitly (not through a static block or something) before a class is used that can set
StateObjectFactory.StateObjectFactory defaultFactory = new StateObjectFactoryImpl();

Note that just having a static variable does not give a dependency, so:
public class StateObjectFactory{
...
  public static final StateObjectFactory;

is ok since we don't try to initialize any StateObjectFactory objects.

ADDING PAUSES:

Another interesting thing to try could also be to add pauses in the class loading. In the classes above, add this to the absolute top of the classes:

  static{
    try{
      Thread.sleep(1000)
    }catch(Exception e){}
  }

This will make class initializing wait for a second before continuing. If the deadlock depends on any of these classes, this should make the 'bad timing' happen more often. If thread A tries to initialize StateObjectFactory, then we have a full second for another thread to initialize StateObjectFactoryImpl to see the deadlock. This could be an interesting thing to see if we can reproduce the issue more often.

Another idea on how to remove the circular dependency:

public interface StateObjectFactory {
...
  private static final StateObjectFactory defaultFactory = null;
  public static StateObjectFactory getFactory(){
    if (defaultFactory == null){
      defaultFactory = new StateObjectFactoryImpl();
    }
    return defaultFactory;
  }

That removes the dependancy. Now you can initialize StateObjectFactory without initializing StateObjectFactoryImpl. Besides, I think it looks nicer as well =)
Comment 10 Thomas Watson CLA 2008-05-23 08:12:02 EDT
(In reply to comment #9)
Andy, thanks again for the detailed information.  I doubt the circular dependency here is causing any deadlock though because the StateObjectFactory class is loaded very early in the launch of the framework and it should be initialized before you get to running any code from Bundles in the framework.

But I will try some of the suggestions you mentioned to force a deadlock to see if this is a general issue for other classes with a similar pattern.
Comment 11 Stephan Herrmann CLA 2008-06-07 16:38:42 EDT
I, too, am having problems with osgi.classloader.singleThreadLoads=true.

A had a reproduceable situation where I couldn't even see any tricky
circularity. Basically, I had a deadlock between two threads trying to use 
a ProgressManager where ProgressManager$1.updateFor was trying to load
interface IJobChangeEvent, both threads locking on the same classloader.
Apparently, the loader.wait() statement in BundleLoader.lock() did not
sufficiently release the lock.
It looked scary in the debugger, since a perfectly simple method call
could not be stepped into, debugger deterministically lost the connection
to its debug target at this statement (reflecting the observation that
the hang occurs inside the native JVM code).

I read that you already understood the cause, which is good.

So far our tool *required* singleThreadLoads=true to run fairly smoothly.
Given that this strategy is dead, too, does anybody have any suggestions
for now, what I should tell our users? I see these options:
 * Whenever a deadlock occurs, kill eclipse and *toggle* the
   singleThreadLoads flag, as it seems to choose between two potential
   deadlocks, which *usually* don't both occur together.
 * Strictly avoid using a Sun VM (which seems to be the only JVM most
   people have installed)? But which ones don't natively lock the
   classloader and are known to work well for Eclipse? IBM? Others?

Is there anything else that can be done?
Comment 12 Thomas Watson CLA 2008-06-07 19:52:46 EDT
(In reply to comment #11)
> Is there anything else that can be done?
> 

Not officially, until Java SE 7 comes out which should allow us to develop a grid delegating class loaders without deadlocking as part of the modularity work of JSR 277.

In the mean time the only thing I can think of you trying is the Sun VM options described in bug 121737 comment 8.  I know of some products using these options successfully.  You can also go add your vote to the very popular Sun bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4670071
Comment 13 Stephan Herrmann CLA 2008-06-30 13:59:10 EDT
(In reply to comment #12)
> In the mean time the only thing I can think of you trying is the Sun VM options
> described in bug 121737 comment 8.  I know of some products using these options
> successfully.  You can also go add your vote to the very popular Sun bug
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4670071

Let me report that we moved to adding those two obscure Sun vm options
to eclipse.ini (successfully using p2's EclipseTouchpoint ;-) )
and thus could move away from single thread loads.

The system appears to be stable in this configuration.

Another positive feedback: due to the hooks added in bug 208591
switching between different locking strategies didn't affect our
own implementation since our hooks simply execute in whatever locking
context they are invoked in. I like that architecture ;-)
Comment 14 Stephan Herrmann CLA 2008-08-31 18:16:15 EDT
Created attachment 111365 [details]
Stacks showing a ghost lock

(In reply to comment #13)
> Let me report that we moved to adding those two obscure Sun vm options
> to eclipse.ini (successfully using p2's EclipseTouchpoint ;-) )
> and thus could move away from single thread loads.
> 
> The system appears to be stable in this configuration.

FWIW: I just observed a deadlock in the debugger where the blocking
lock was never taken. More specifically:
As mentioned we are using these VM-args:
  -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass
The attached stack trace shows two threads interlocked in an 
unearthly fashion:

Thread "main" has the classname lock and waits for a DefaultClassLoader lock,
which is supposed to be owned by Thread "Worker-2".

Thread "Worker-2" tries to obtain the classname lock, but is *not*
executing any code that should cause a DefaultClassLoader lock =:-0
(the debugger shows that all frames mentioning DefaultClassLoader
indeed happened at the very instance that "main" is waiting for)

I guess we should read this as: the -XX.. options are still buggy
in a way that the VM still takes a lock which is invisible in the stack
trace (or perhaps: had been taken without being ever returned or s.t.).

Reporting here just for completeness. *Most* the times the machine
runs smooth, at least significantly better than with singleThreadLoads.
Comment 15 Thomas Watson CLA 2008-09-02 10:09:09 EDT
If you get into a situation where you can reproduce this hang then please try the following configuration option to see if it goes away:

osgi.support.class.certificate=false

From the stack trace it appears to be an issue when the VM is checking the certificates of the class on the main thread and needs the classloader lock to do so.  But I'm not sure where the thread Worker-2 is holding onto the class loader lock.  Perhaps it is as you say and the VM is holding the lock under the covers when it should not be.
Comment 16 Stephan Herrmann CLA 2009-01-20 11:48:42 EST
Created attachment 123096 [details]
deadlock in disabled checkCerts

While most the time things work fine, I added 
   -Dosgi.support.class.certificate=false
to my command line, just in case. 
My eclipse.ini also has:
   -XX:+UnlockDiagnosticVMOptions
   -XX:+UnsyncloadClass
config.ini has:
   osgi.classloader.lock=classname
   osgi.classloader.singleThreadLoads=false

A while ago I recorded a stack dump of a deadlock that
happened despite this precaution. Deadlock occurred while
trying to enter ClassLoader.checkCerts. Is the switch
intended to stop this invocation or were you talking about
a different call chain?

In another thread, loadClassInternal() again took a lock,
despite being told not to do so.

Still nothing reproducable on my side, it _usually_ works :-/
Comment 17 Thomas Watson CLA 2009-04-22 10:55:03 EDT
We should look at using the new method in Java 7 on ClassLoader.registerAsParallelCapable() method to fix the deadlock issues.  If this method is available and returns true then we should lock on class name instead of on the class loader object.

see http://download.java.net/jdk7/docs/api/java/lang/ClassLoader.html#registerAsParallelCapable()
Comment 18 Thomas Watson CLA 2009-04-22 15:10:07 EDT
Created attachment 132826 [details]
work in progress

Also see http://openjdk.java.net/groups/core-libs/ClassLoaderProposal.html

Java 7 is a ways off yet.  I would like to add some support for this into 3.5 but I think it may have to remain disabled by default with an option to enable it.  This patch illustrates what I am thinking.  The patch is untested.  Need to get Java 7 setup on my Windows machine.  Unfortunately there are no Java 7 builds available for the Mac :(
Comment 19 Thomas Watson CLA 2009-04-22 16:37:53 EDT
Created attachment 132840 [details]
work in progres 2

The registerAsParallelCapable method is a static protected method.  This forces me to use getDeclaredMethod and setAccessible in order to use it with reflection.  This code now does the right thing on Java 7.  Still need to do some testing to try to force a deadlock.
Comment 20 Thomas Watson CLA 2009-04-23 12:18:46 EDT
Using the original testcase from bug 121737 I am able to reproduce the deadlock on Java 7 without the parallel option enabled.  Once I enable this option I cannot get the deadlock to happen, even when trying to force the deadlock by breaking in the class loader with the debugger.
Comment 21 Thomas Watson CLA 2009-04-23 12:58:05 EDT
Created attachment 132976 [details]
updated patch

Updated patch to include javadoc for ParallelClassLoader interface.  While this is not true API (it is considered the SPI for the framework hooks), it is nice to give a description of how this interface is used by the ClasspathManager.  I tried to leave any mention of Java 7 APIs out of the description since this API is not final and will not be available for some time.
Comment 22 Thomas Watson CLA 2009-04-23 15:09:51 EDT
Created attachment 133002 [details]
final patch

Did a review with John.  I had a typo in the ParallelClassLoader.isParallelCapable() method name.  I also added a note to the interface stating that the interface is an interim API and subject to change.
Comment 23 Thomas Watson CLA 2009-04-23 15:47:19 EDT
Renaming bug to reflect the content if this bug report and fix more accurately.

We need to document that the osgi.classloader.singleThreadLoads is deprecated and useless on modern VMs (1.5 or greater) because of the native VM class name locking.  We also need to document a the new option "osgi.classloader.type".  If set to "parallel" and run on Java 7 then the OSGi class loader will lock on the class name instead of the class loader when finding/defining classes.
Comment 24 Andy Piper CLA 2009-04-28 05:42:56 EDT
Hey Tom, related question - is this problem known to exist or not exist on the IBM VM? Thanks -- andy
Comment 25 Thomas Watson CLA 2009-04-28 09:18:49 EDT
(In reply to comment #24)
> Hey Tom, related question - is this problem known to exist or not exist on the
> IBM VM? Thanks -- andy
> 

No, the IBM VM does not lock the class loader object natively.  But in Equinox we still lock the class loader object when defining classes by default.  This can still lead to a rare case of deadlock when circularity is involved.  You can enable the same class name locking strategy that we use on Java 7 by setting the following property on IBM VM (or any other VM that does not lock the class loader natively):

osgi.classloader.lock=classname
Comment 26 Thomas Watson CLA 2010-02-01 09:51:30 EST
*** Bug 221329 has been marked as a duplicate of this bug. ***
Comment 27 Thomas Watson CLA 2010-02-03 11:38:25 EST
*** Bug 301640 has been marked as a duplicate of this bug. ***
Comment 28 Hasan Ceylan CLA 2010-02-10 04:53:31 EST
Is this fix included Eclipse 3.6M5?

For one particular workspace it took ~10 times of which 9 locked up.
Comment 29 Hasan Ceylan CLA 2010-02-10 04:55:39 EST
It seems like 3.6M5 has the patch, It still locks for me and this is very frequently.
Comment 30 Andy Piper CLA 2010-02-10 05:44:05 EST
The patch only works if you are running on Java SE 7. Are you?
Comment 31 Hasan Ceylan CLA 2010-02-10 05:48:22 EST
I tried with openjdk 1.6 and and sun jdk 1.6.

What's the solution for 1.6?
Comment 32 Andy Piper CLA 2010-02-10 05:52:43 EST
There isn't one. Fundamentally it's a problem in the VM.
You could try with JRockit as this is much less susceptible to the problem.
Comment 33 Andy Piper CLA 2010-02-10 05:53:41 EST
   -XX:+UnlockDiagnosticVMOptions
   -XX:+UnsyncloadClass

 also works tolerably well for many people.
Comment 34 Hasan Ceylan CLA 2010-02-10 06:07:39 EST
Thanks for the directions....
Comment 35 Thomas Watson CLA 2010-10-21 17:25:19 EDT
*** Bug 215834 has been marked as a duplicate of this bug. ***
Comment 36 Thomas Watson CLA 2010-12-04 10:47:51 EST
*** Bug 331818 has been marked as a duplicate of this bug. ***
Comment 37 Nicolas Rouquette CLA 2011-03-15 13:35:11 EDT
(In reply to comment #33)
>    -XX:+UnlockDiagnosticVMOptions
>    -XX:+UnsyncloadClass
> 
>  also works tolerably well for many people.

Except for EMF's EPackage registry which does unsynchronized classloader operations that can trip this bug, even with the above settings.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=340061
Comment 38 Thomas Watson CLA 2011-08-16 17:06:17 EDT
*** Bug 354844 has been marked as a duplicate of this bug. ***
Comment 39 Dani Megert CLA 2011-10-31 06:08:13 EDT
*** Bug 362154 has been marked as a duplicate of this bug. ***
Comment 40 Thomas Watson CLA 2011-11-21 14:26:29 EST
*** Bug 364202 has been marked as a duplicate of this bug. ***
Comment 41 Thomas Watson CLA 2012-01-31 08:47:11 EST
*** Bug 369917 has been marked as a duplicate of this bug. ***
Comment 42 Dani Megert CLA 2012-04-25 06:19:06 EDT
*** Bug 377609 has been marked as a duplicate of this bug. ***
Comment 43 Thomas Watson CLA 2012-09-18 02:11:00 EDT
*** Bug 389659 has been marked as a duplicate of this bug. ***
Comment 44 Thomas Watson CLA 2012-11-15 08:29:56 EST
*** Bug 394363 has been marked as a duplicate of this bug. ***