Bug 396730 - Classloader deadlock related to org.eclipse.birt.core.framework.osgi.OSGILauncher$ChildFirstURLClassLoader
Summary: Classloader deadlock related to org.eclipse.birt.core.framework.osgi.OSGILaun...
Status: NEW
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: BIRT (show other bugs)
Version: 2.6.1   Edit
Hardware: PC Windows XP
: P3 normal with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Birt-ReportEngine-inbox@eclipse.org CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-17 05:23 EST by Olivier LE JACQUES CLA
Modified: 2014-11-20 17:27 EST (History)
6 users (show)

See Also:


Attachments
Dump of a deadlock crash (39.41 KB, text/plain)
2012-12-17 05:23 EST, Olivier LE JACQUES CLA
no flags Details
Birt servlet (1.28 KB, text/xml)
2012-12-26 04:15 EST, Olivier LE JACQUES CLA
no flags Details
OC4J threads dump (23.65 KB, text/plain)
2013-02-12 11:05 EST, Nicolas Lecroart CLA
no flags Details
patch to resolve the thread dead lock (20.99 KB, application/octet-stream)
2013-02-20 20:39 EST, Wei Yan CLA
no flags Details
BIRT 2.5.2 classloader deadlock on Sun JVM (19.21 KB, text/plain)
2014-11-20 17:27 EST, Volker Kleinschmidt CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier LE JACQUES CLA 2012-12-17 05:23:28 EST
Created attachment 224794 [details]
Dump of a deadlock crash

When i initialize the report engine of birt, a random circular dealock may occurs.

Please find the attached dump of the jvm

Configuration :
weblogic 10.3.2
jvm jrockit 1.6.0.14

Regards, Olivier
Comment 1 Gang Liu CLA 2012-12-17 21:34:29 EST
Could you please tell me more about how to reproduce this bug?
Comment 2 Olivier LE JACQUES CLA 2012-12-21 10:14:05 EST
I don't konw exactly how to reproduce it.

This bug happens when spring initialize a birt servlet we define. We call the method Platform.startup() during the initialization.

We use spring 3.0.5 with a context loader listener.

The thread which initialize the servlet lock with the thread of the context loader listener of spring.
Comment 3 Gang Liu CLA 2012-12-25 05:18:17 EST
Could you please attach the birt servlet?
Comment 4 Olivier LE JACQUES CLA 2012-12-26 04:15:26 EST
Created attachment 225050 [details]
Birt servlet
Comment 5 Olivier LE JACQUES CLA 2012-12-26 04:17:33 EST
Please find the attached birt servlet.

Here is the method called to initialize the report engine :

public void startReportEngine() {
    final EngineConfig config = new EngineConfig();
    config.setEngineHome("");
    initLogger(config);
    final IPlatformContext context = new PlatformServletContext(this.servletContext);
    config.setPlatformContext(context);
    try {
      Platform.startup(config);
    } catch (final BirtException e) {
      LOGGER.error(e.getMessage());
    }
    final IReportEngineFactory factory = (IReportEngineFactory) Platform
       .createFactoryObject(IReportEngineFactory.EXTENSION_REPORT_ENGINE_FACTORY);
    this.birtEngine = factory.createReportEngine(config);
  }
Comment 6 yc d CLA 2013-01-04 01:21:25 EST
I retested the same scenario as what you said above.
Using Weblogic with its JRockit, using Spring initialize report engine.

I cannot reproduce the deadlock issue.

I think there is something wrong with the usage of initialization of report engine. Our report engine is a singleton and there is only one instance of report engine in heap. But the test case sciript produce new instance of report engine for each initialization.
Comment 7 Nicolas Lecroart CLA 2013-02-12 11:02:29 EST
Hello,

We are facing a similar problem on AIX using the BIRT 2.6.2 report engine in an EAR file running on OC4J (see attachment for threads dump). It seems to be quite similar to the deadlock situations reported in https://bugs.eclipse.org/bugs/show_bug.cgi?id=287102 with Jetty/Tomcat (supposed to be fixed in 2.5.1, could you please give us more details on what has been fixed?).

With bug 287102 and the problem reported above, it seems that the OSGILauncher$ChildFirstURLClassLoader is involved in deadlocks in various environments. As per source code, this class "alters regular ClassLoader delegation and will check the URLs used in its initialization for matching classes before delegating to it's parent.".
It seems also that setting the EngineConstants.APPCONTEXT_CLASSLOADER_KEY has an impact on the deadlock frequency. In our application, we always set the appcontext classloader to the EAR classloader as we want the BIRT engine to be able to load our classes.
Is it possible that the approach implemented by the OSGILauncher$ChildFirstURLClassLoader generates deadlocks by altering the classloading order and causing threads to take locks in an unexpected order while loading classes?
How come the OSGILauncher$ChildFirstURLClassLoader is playing a role when the application server is loading a class which is not related to BIRT? Is it because the OSGI classloader is added to the classloader hierarchy of the application server when BIRT starts? If yes, then setting the appcontext classloader to your application classloader will typically creates a classloader cycle. Is there another way to make our classes available to BIRT without using the appcontext classloader and which would not suffer from this issue?

Thanks for your help.
Comment 8 Nicolas Lecroart CLA 2013-02-12 11:05:34 EST
Created attachment 226930 [details]
OC4J threads dump
Comment 9 Nicolas Lecroart CLA 2013-02-19 03:48:08 EST
Hi,

Could somebody tell us if we can expect some help on this? This deadlock is happening on a production system on a regular basis and our customer is waiting for feedback. 

Thanks.
Comment 10 Wei Yan CLA 2013-02-20 20:39:37 EST
Created attachment 227375 [details]
patch to resolve the thread dead lock
Comment 11 Nicolas Lecroart CLA 2013-02-26 04:28:09 EST
Thanks for answering.

Could you please provide a word of explanation about the suggested fix as I have some doubts about it?

My understanding of the problem is that the OSGILauncher.ChildFirstURLClassLoader which is created with the application classloader as parent is creating a classloader cycle and can cause a deadlock (see detailed deadlock scenario description below). I do not see why the modified OSGILauncher you provided would not cause the same problem to happen. The ChildFirstURLClassLoader was replaced by a FrameworkClassLoader which does not pass the application classloader to the super constructor and does not get it back by calling getParent but it still keeps a reference to it and uses it in the loadClass method in the exact same way the ChildFirstURLClassLoader was doing. There is also a synchronized block which was removed but this one was not the cause of the deadlock.
Am I missing something?

I do not have enough knowledge on BIRT to suggest a fix but could you explain why the registered URLStreamHandler factory needs to call into the OSGILauncher.ChildFirstURLClassLoader as it seem to be the root cause of the problem?

Detailed deadlock scenario:

Classloader hierarchy in OC4J:

JRE bootstrap classloader
JRE extension classloader
API classloader
OC4J classloader
System classloader
Global classloader
Application classloader

Thread T1 tries to load a resource using the System CL which delegates all the way up to the JRE extension CL (taking locks on all classloaders). The registered URLStreamHandler factory calls into OSGILauncher.ChildFirstURLClassLoader which tries to delegate to the Application CL which is already locked by thread T2.

Thread T2 asks the Application CL to load a class, takes a lock on it and and traverses the CL hierarchy all the way up until it reaches the System CL which is already locked.

Note also that:
1) The modified OSGILauncher class extends PlatformLauncher which is not found in BIRT 2.6.2 source code archive (extends clause can be removed I guess).
2) Line 716 is incorrect and must be changed to avoid NPE.

Best regards
Comment 12 Wei Yan CLA 2013-02-26 14:16:24 EST
Actually I find there are two application classloader here which cause the dead lock:

BIRT application (BIRT), the hierarchy is:
  global_class_loader  (loader_global)
  application_class_loader (load_birt)

Another application APP), the hierarchy is:
  global_class_loader (load_)
  application_class_loader (load_app);

one thread in app tries to load the some resource, it locks the class loader as:

 lock load_app
 lock load_global
 use OSGi URL handler to load URL, it delegate to load_birt
 try to lock birt

another thread in BIRT is loading a class, the lock sequence is:
 lock loader_birt
 try to locking load_global
Comment 13 Wei Yan CLA 2013-02-26 14:44:52 EST
As similar bug is:
http://www-01.ibm.com/support/docview.wss?uid=swg1IV25687
Comment 14 Wei Yan CLA 2013-02-26 14:47:40 EST
A potential workaround is use a dedicated application server for BIRT.
Comment 15 Nicolas Lecroart CLA 2013-02-26 16:11:14 EST
I indeed forgot to mention that the scenario involves two applications. However, 
1)using a dedicated server for BIRT is way too restrictive and does not fit in our deployment scheme.
2)I think this scenario can happen even with BIRT running in a dedicated server as the classloader cycle would still exist:
Thread1 tries to load a resource using getSystemResource, locks the System CL and calls the URLStreamHandler that will attempt to lock the Application CL (not yet locked as not needed until now!). In the meantime, Thread2 which is trying to load a class locks Application CL and can't obtain the lock on System CL.

Can you think of a way to modify the BIRT class loading strategy that would avoid this bug? What would prevent to use a custom classloader which does not delegate to the Application classloader to implement the URLStreamHandler?
Comment 16 Wei Yan CLA 2013-02-26 18:21:07 EST
actually it is not related with BIRT class loader but weblogic's PolicyClassLoader.

System class loader is not locked during getResource. Comparing java.lang.ClassLoader.getResource() with oracle.classloader.PolicyClassLoader, you may find that oracle's getResource is synchronized while system's is not. So in dedicated BIRT server, system class loader won't deadlock with application class loader.

It only happens if there are multiple application class loaders.
Comment 17 Nicolas Lecroart CLA 2013-03-06 16:56:15 EST
I don’t agree.

I really don’t see why the fact that oracle.classloader.PolicyClassLoader.getResource is synchronized and java.lang.ClassLoader.getResource is not makes you think that the deadlock won’t happen if the BIRT WAR (or more generally an application that embeds BIRT) is the only module deployed on the OC4J application server. The classloader cycle created by  the URLStreamHandler registered by BIRT will still exist and might still lead to the same kind of scenario. Maybe another member of the BIRT dev team could take a fresh look at this discussion and provide a third point of view.

Now, even if you would be right, would that make it more acceptable?  I don’t think it is a minor thing if a BIRT module can’t be run safely next to another module and I am sure we are not the only users to do that.

I am willing to take on my own time to try to find a solution but I need technical guidance. Once more, can you think of a way to modify the org/eclipse/osgi/framework/internal/protocol/StreamHandlerFactory that would avoid this problem? 

Thanks
Comment 18 Wei Yan CLA 2013-03-06 17:49:36 EST
I think avoiding dynamic loading class from /osgi/framework/internal/protocol/StreamHandlerFactory can resolve this issue.
Comment 19 Nicolas Lecroart CLA 2013-04-04 09:09:34 EDT
It is a bit unfortunate we did not get more support on this but here is a possible workaround that we will suggest to our customer:

As the Javadoc explains (http://docs.oracle.com/javase/6/docs/api/java/net/URL.html), when some code tries to access a resource which uses a certain protocol, the JVM will try to create a protocol handler by invoking the URLStreamHandlerFactory (if any was registered). If there is no URLStreamHandlerFactory or if this one can’t locate an appropriate protocol handler, it will read the value of the java.protocol.handler.pkgs system property (com.evermind.protocol package for the IBM AIX JVM)and interpret it as a list of packages from where the protocol handler could be loaded. If this fails, the JVM will finally try to load the protocol handler from a default system package (sun.net.www.protocol package for the IBM AIX JVM). The URLStreamHandlerFactory registered by BIRT 2.6.2 tries to guess whether the JVM will be able to find a protocol handler by looking at the java.protocol.handler.pkgs system variable and only uses the OSGILauncher$ChildFirstURLClassLoader in case it thinks the JVM won’t be able to find one. However, the BIRT code does not include the default system package in its preliminary check (most probably because it is implementation dependent) which happens to contain all the main protocol handlers for the IBM JVM and therefore starts traversing the classloaders cycle created by this OSGILauncher$ChildFirstURLClassLoader when there is no reason to do so. By redefining the java.protocol.handler.pkgs system variable to be com.evermind.protocol|sun.net.www.protocol (for the IBM JVM), we can change the behaviour of the BIRT code so that it does not do this when asked to provide a protocol handler for one of the default protocols. This can be done by starting the application server with the "-Djava.protocol.handler.pkgs=com.evermind.protocol|sun.net.www.protocol" option. A similar workaround can be applied for other application server I guess by adapting the value of this variable.
Comment 20 Volker Kleinschmidt CLA 2014-11-20 17:27:00 EST
Created attachment 248803 [details]
BIRT 2.5.2 classloader deadlock on Sun JVM

This type of classloader deadlock happens not only on IBM JRockit, but also on the Sun JVM - see the attached deadlock report using BIRT 2.5.2, i.e. after the fix for bug bug 287102, the original ticket where this issue was reported.

The problem, as Wei laid it out, is with one classloader that's delegating normally, and invoking the ChildFirstURLClassLoader during URL resolution within StreamHandlerFactory, whereas another class is delegating from ChildFirstURLClassLoader up to the parent.

This bug is not in resolved state, and there is no indication that the proposed patch ever made it into the product, nor that it resolved the issue. Can this please be looked at again? The state of NEW is certainly wrong for this ticket.