Bug 249930 - Deadlock with JavaModelManager$PerProjectInfo
Summary: Deadlock with JavaModelManager$PerProjectInfo
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Core (show other bugs)
Version: 3.4   Edit
Hardware: PC Windows XP
: P3 critical with 1 vote (vote)
Target Milestone: 3.5 M3   Edit
Assignee: Jerome Lanneluc CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 263094 265920 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-10-07 06:07 EDT by Xu XIANG CLA
Modified: 2009-11-09 17:06 EST (History)
15 users (show)

See Also:


Attachments
Thread Dump (30.91 KB, text/plain)
2008-10-07 06:07 EDT, Xu XIANG CLA
no flags Details
Proposed workaround (4.82 KB, patch)
2008-10-09 12:32 EDT, Jerome Lanneluc CLA
no flags Details | Diff
Fix backported to R3_4_maintenance branch (4.88 KB, patch)
2009-02-06 06:39 EST, Jerome Lanneluc CLA
no flags Details | Diff
3.4.3Thread Dump (18.18 KB, text/plain)
2009-03-13 10:15 EDT, Ashwani Kr Sharma CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Xu XIANG CLA 2008-10-07 06:07:47 EDT
Created attachment 114401 [details]
Thread Dump

Build ID: I20080617-2000

The deadlock occurs multiple times in the manual and automated tests.Please find the thread dump in attachment. The deadlock is between worker-58 and ModalContext due to reverse lock/monitor order
(org.eclipse.jdt.internal.core.JavaModelManager$PerProjectInfo + WorkManager lock).
Comment 1 Jerome Lanneluc CLA 2008-10-07 06:39:47 EDT
I believe this is a the same problem as reported in bug 246136, i.e. "ModalContext" should never have reached this point since it takes a "project modification" scheduling rule which conflicts with the notification of resource changes. So there must be a misbehaving scheduling rule in the stack of "ModalContext". I'm not sure how to identify such misbehaving scheduling rule.

Moving to Platform/Resources for further investigation.
Comment 2 Xu XIANG CLA 2008-10-07 07:20:26 EDT
It looks like that the notification of resource is done asynchronously in a separate job with scheduling rule <null>:
org.eclipse.core.internal.events.NotificationManager$NotifyJob.run(org.eclipse.core.runtime.IProgressMonitor) line: 39
Comment 3 Szymon Brandys CLA 2008-10-07 09:05:25 EDT
(In reply to comment #2)
> It looks like that the notification of resource is done asynchronously in a
> separate job with scheduling rule <null>:

It's because the job is used just to trigger the notification.
Comment 4 Xu XIANG CLA 2008-10-07 09:31:50 EDT
(In reply to comment #1)
> ... "ModalContext" should never have reached this point since it takes a "project
> modification" scheduling rule which conflicts with the notification of resource
> changes. So there must be a misbehaving scheduling rule in the stack of
> "ModalContext"...

then (accordingly to comment #3) there is no "misbehaving scheduling rule", this means also that there is no scheduling rule conflict between ModalContext and worker-58. ModalContext can run with workspace root as rule.

Now the question is where is the root cause of this deadlock, in platform or JDT?

P.S. I don't think scheduling rule <null> is a good idea. It will change the lock order (resource lock --> workManager lock) potentially, e.g. if someone tries to resource modification in the notification thread.
- Is this allowed?
- Does Eclipse have a general recommendation/guideline to avoid deadlock, at least for the common locks/monitors?
Comment 5 Szymon Brandys CLA 2008-10-07 09:51:06 EDT
(In reply to comment #4)
> Now the question is where is the root cause of this deadlock, in platform or
> JDT?

I am investigating, however there are some other candidates (wtp, com.sap.caf.designer?)

> P.S. I don't think scheduling rule <null> is a good idea. It will change the
> lock order (resource lock --> workManager lock) potentially, e.g. if someone
> tries to resource modification in the notification thread.
> - Is this allowed?

No. See WorkManager#checkIn, each attempt to modify the resource tree during notification will fail.

> - Does Eclipse have a general recommendation/guideline to avoid deadlock, at
> least for the common locks/monitors?

I think that Jerome is right and this issue is causes by misbehaving scheduling rule. To see how to build a correct rule, see the javadoc of ISchedulingRule.

Comment 6 John Arthorne CLA 2008-10-07 11:01:41 EDT
This isn't the same as bug 246136. The thread performing the notification doesn't have any scheduling rule so there is no rule conflict here.

The problem here is very simple. The lock on class JavaModelManager$PerProjectInfo is acquired during a resource change event by a listener. Another thread that owns this lock wants to modify the workspace, but we do not allow other threads to modify the workspace during resource change events. The *only* way the platform could prevent deadlock here is to allow the other thread to modify the workspace during the resource change event. This is not allowed by the IResourceChangeEvent contract and would likely cause breakage for many listeners.

re comment #4 : yes, we prevent resource changes from occurring during resource change events (an exception is thrown if a listener tries to modify the workspace during an event). This prevents the inversion of the lock order that you mention, which would as you say cause deadlocks.
Comment 7 Szymon Brandys CLA 2008-10-07 11:42:26 EDT
Moving to JDT Core.
Comment 8 Jerome Lanneluc CLA 2008-10-07 11:48:26 EDT
There is no way for a client to know that the workspace modification is not allowed in a thread other than the notification thread. Even if there were such API (e.g. isTreeLocked() would work in all threads), this would not solve the problem since the notification could still occur between the isTreeLocked() check and the workspace modification.

Back to Platform/Resources.
Comment 9 Szymon Brandys CLA 2008-10-07 12:07:06 EDT
(In reply to comment #8)
> There is no way for a client to know that the workspace modification is not
> allowed in a thread other than the notification thread. 

That's why we should avoid synchronized blocks in resource change listener, or use them only if we are sure that client's code will not access them during notification. This is how I understand John's comment.
Comment 10 John Arthorne CLA 2008-10-07 12:47:15 EDT
Jerome, can you describe what you expect us to do differently here? One thread is inside a resource change event, and another thread is trying to modify the workspace. Are you suggesting we should let the other thread modify the workspace during the event?
Comment 11 Jerome Lanneluc CLA 2008-10-07 12:50:32 EDT
You cannot tell clients to NOT use such an important feature of the Java language. Without synchronized blocks, clients have no way to ensure the consistency of their data.
Comment 12 Jerome Lanneluc CLA 2008-10-07 12:52:28 EDT
The thread that modifies the workspace made sure to use a scheduling rule before taking the PerProjectInfo lock. I would expect that the resource notification to not run when this rule is used.
Comment 13 John Arthorne CLA 2008-10-07 18:31:59 EDT
> You cannot tell clients to NOT use such an important feature of the Java
language.

Anyone using this language feature needs to understand the risk of deadlock from using it. If you call third party code that isn't under your control from such a synchronized block, and you also acquire the same lock in a callback where you don't know the state of the calling thread, you are inherently deadlock prone. It happens in this case that your lock is interacting with a lock in platform resources, but a similar deadlock could occur between your lock and some other plugin that uses the same pattern.

> The thread that modifies the workspace made sure to use a scheduling rule
> before taking the PerProjectInfo lock. I would expect that the resource
> notification to not run when this rule is used.

This means no resource change events would occur during long-running operations. For example if you did a CVS checkout or full build that took several minutes, then no resource change events would occur during that time for any change. This has very strange behaviour - for example you would double-click in an editor ruler to add a breakpoint, and the breakpoint icon would only appear several minutes later when the change event ran.

Incidentally platform resources has been concurrent like this since Eclipse 3.0, there is nothing changed here recently. I think this particular deadlock might be caused by the extra synchronization added to address bug 235778.
Comment 14 Jerome Lanneluc CLA 2008-10-08 04:23:15 EDT
(In reply to comment #13)
> Anyone using this language feature needs to understand the risk of deadlock
> from using it. 
I'm perfectly aware of this risk. This is why I tried to acquire the Platform/Resources lock before acquiring my lock (by running with a scheduling rule). It looks like this is not the correct Platform/Resources lock.

> If you call third party code that isn't under your control from
> such a synchronized block, and you also acquire the same lock in a callback
> where you don't know the state of the calling thread, you are inherently
> deadlock prone. 
Isn't it exactly what Platform/Resources is doing by calling a resource listener inside the resource notification lock and also acquiring the same lock in another thread when the client tries to modify a resource?

> It happens in this case that your lock is interacting with a
> lock in platform resources, but a similar deadlock could occur between your
> lock and some other plugin that uses the same pattern.
If this Platform/Resources lock was available to clients, I would be able to acquire it before acquiring my lock.

> This means no resource change events would occur during long-running
> operations. For example if you did a CVS checkout or full build that took
> several minutes, then no resource change events would occur during that time
> for any change. This has very strange behaviour - for example you would
> double-click in an editor ruler to add a breakpoint, and the breakpoint icon
> would only appear several minutes later when the change event ran.
OK, so maybe the scheduling rule is not the correct lock to acquire. But the client needs the lock that the resource notification takes. Then it is the responsibility of the client if this lock is taken for too long.

> Incidentally platform resources has been concurrent like this since Eclipse
> 3.0, there is nothing changed here recently. 
Sure, but that doesn't mean that the design was correct in 3.0 and that it should not be changed now.

> I think this particular deadlock
> might be caused by the extra synchronization added to address bug 235778.
If you have another solution to this problem, I'll take it. But this won't solve this particular Platform/Resources issue.

Comment 15 Jerome Lanneluc CLA 2008-10-09 12:08:34 EDT
It looks like there is no easy solution on Platform/Resources side. I entered bug 250311 though. In the meantime, we will workaround the issue in JDT/Core.
Comment 16 Jerome Lanneluc CLA 2008-10-09 12:32:09 EDT
Created attachment 114697 [details]
Proposed workaround
Comment 17 Jerome Lanneluc CLA 2008-10-10 05:49:45 EDT
Workaround released for 3.5M3

(Note there is no regression test since this is a concurrency issue. To verify that the fix is in, one can only check the code)
Comment 18 Olivier Thomann CLA 2008-10-27 16:07:25 EDT
Verified for 3.5M3 using I20081026-2000
Comment 19 Vasil Vasilev CLA 2008-11-07 09:03:28 EST
We need the workarond downported to 3.4.1 and available also in 3.4.2. What is the procedure for requesting a patch?
Comment 20 Philipe Mulet CLA 2008-11-07 09:44:40 EST
3.4.1 is already out. Best we can do if we decide to backport the fix for 3.4.2 is for you to consume an early 3.4.2 build (they are occurring weekly already).
Comment 21 Philipe Mulet CLA 2008-11-07 09:45:22 EST
Note that you can also construct the backport patch and attach it to this bug.
Comment 22 Vasil Vasilev CLA 2008-11-17 03:25:12 EST
What does it mean "you can also construct"? Can you explain me in more details what are the exact steps that I should do in order to have the fix in 3.4.2? (Sorry, I am a novice in this process)
Comment 23 Philipe Mulet CLA 2008-11-17 05:19:58 EST
Given Jerome attached a patch for 3.5, you can grap it and try to apply it to 3.4, then correct issues, and once you tested it, you can repost it back on this bug so it may be contributed to 3.4.2.

Also read http://wiki.eclipse.org/JDT_Core_Committer_FAQ
Comment 24 Jerome Lanneluc CLA 2008-11-17 05:29:43 EST
Also note that this fix as ripples on clients. For example, this caused bug 251629. So bug 251629 would also need to be backported.
Comment 25 Jerome Lanneluc CLA 2009-02-06 06:34:23 EST
*** Bug 263094 has been marked as a duplicate of this bug. ***
Comment 26 Jerome Lanneluc CLA 2009-02-06 06:39:41 EST
Created attachment 124948 [details]
Fix backported to R3_4_maintenance branch

Products are hitting this deadlock, with no workaround. So this needs to be backported to 3.4.x
Comment 27 Jerome Lanneluc CLA 2009-02-06 06:46:33 EST
Fix released in R3_4_maintenance.
Comment 28 Stephan Herrmann CLA 2009-02-10 11:25:38 EST
(In reply to comment #27)
> Fix released in R3_4_maintenance.

Will this be contained in SR2?
I'm puzzled because I see that recent versions in the maintenance
branch already say "3.4.3" (MANIFEST.MF and buildnotes) and
"post 3.4.2 release" (../batch/messages.properties).

Has it been decided, which version will be JDT/Core's contribution to 3.4.2?

Would I run into any known risks, when adopting this fix into our branch?
How about the other recent backport in bug 261510 (it seems less critical?)?
Comment 29 kiril mitov CLA 2009-03-13 04:31:22 EDT
*** Bug 265920 has been marked as a duplicate of this bug. ***
Comment 30 kiril mitov CLA 2009-03-13 09:32:27 EDT
(In reply to comment #27)
> Fix released in R3_4_maintenance.
> 

Hi Jerome, 
If this was release on 2009-02-06 should it also be in the build from 2009-02-11? Because this fix is not currently in JDT 3.4.4.v_894_R34x which comes with eclipse 3.4.2 build.



Comment 31 Ashwani Kr Sharma CLA 2009-03-13 10:13:58 EDT
Hi,

This issue still occurs with elipse with version : "3.4.3.R34x_v20081215-1030".
Please look into it.

Added new thread dump.

Regards,
Ashwani K Sharma
Comment 32 Ashwani Kr Sharma CLA 2009-03-13 10:15:54 EDT
Created attachment 128715 [details]
3.4.3Thread Dump

Thread dump for 3.4.3 release.
Comment 33 Jerome Lanneluc CLA 2009-03-24 11:45:26 EDT
(In reply to comment #30)
> If this was release on 2009-02-06 should it also be in the build from
> 2009-02-11? Because this fix is not currently in JDT 3.4.4.v_894_R34x which
> comes with eclipse 3.4.2 build.
The fix is indeed in the R3_4_maintenance branch but it was not included in 3.4.2. There is no 3.4.3 planned, but if there were one, the fix would be in a 3.4.3 maintenance release. 

(In reply to comment #31)
> This issue still occurs with elipse with version : "3.4.3.R34x_v20081215-1030".
I'm not sure where this version come from, but I suspect you're looking at a plugin version which is different from the Eclipse version. Since there is no build containing this fix, it is expected that you still see the problem.
Comment 34 Olivier Thomann CLA 2009-11-09 17:06:14 EST
*** Bug 293821 has been marked as a duplicate of this bug. ***