Bug 243547 - Deadlock with JPA Validator
Summary: Deadlock with JPA Validator
Status: CLOSED FIXED
Alias: None
Product: WTP Source Editing
Classification: WebTools
Component: wst.sse (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows XP
: P2 critical (vote)
Target Milestone: 3.0.1   Edit
Assignee: Nitin Dahyabhai CLA
QA Contact: Nitin Dahyabhai CLA
URL:
Whiteboard: PMC_approved
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-08 04:55 EDT by Gary Karasiuk CLA
Modified: 2008-08-28 14:57 EDT (History)
8 users (show)

See Also:
david_williams: pmc_approved+
raghunathan.srinivasan: pmc_approved+
thatnitind: pmc_approved? (naci.dai)
david_williams: pmc_approved?
neil.hauge: pmc_approved+
thatnitind: pmc_approved? (kaloyan)
thatnitind: review+
cbridgha: review+


Attachments
javacore showing deadlock (311.43 KB, application/octet-stream)
2008-08-08 04:55 EDT, Gary Karasiuk CLA
no flags Details
Condensed version of dump (9.14 KB, text/plain)
2008-08-12 13:37 EDT, Brian Vosburgh CLA
no flags Details
possible patch (2.58 KB, patch)
2008-08-13 16:17 EDT, Nitin Dahyabhai CLA
no flags Details | Diff
javacore for reopened defect (219.86 KB, application/octet-stream)
2008-08-26 08:43 EDT, Gary Karasiuk CLA
no flags Details
next attempt (230.54 KB, application/octet-stream)
2008-08-26 08:57 EDT, Gary Karasiuk CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gary Karasiuk CLA 2008-08-08 04:55:23 EDT
Created attachment 109502 [details]
javacore showing deadlock

I applied the patch for bug 241713 and I was in the process of verifying it. The patch fixes the original deadlock, but the deadlock has moved to a different place.

I have attached a stack trace. The deadlock is between threads Worker-13 (JPA Validator) and Worker-18 (not sure, maybe the HTML validator). 

Worker-13 is waiting for a ModelManagerImpl lock as shown by this line:
at org/eclipse/wst/sse/core/internal/model/ModelManagerImpl.isShared(ModelManagerImpl.java:1518(Compiled Code))

Worker-18 has the MML lock based on this line in it's stack trace:
at org/eclipse/wst/sse/core/internal/model/ModelManagerImpl.releaseFromRead(ModelManagerImpl.java:1832(Compiled Code))

But Worker-18 is blocked on Worker-13:
2LKMONINUSE      sys_mon_t:0x38A4A828 infl_mon_t: 0x38A4A860:
3LKMONOBJECT       org/eclipse/jpt/core/internal/GenericJpaModel@04CD36F8/04CD3704: owner "Worker-13" (0x394B7A00), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "Worker-1" (0x320F1300)
3LKWAITER                "Worker-4" (0x32124400)
3LKWAITER                "Worker-11" (0x39499000)
3LKWAITER                "Worker-18" (0x3B888100)
Comment 1 Gary Karasiuk CLA 2008-08-08 04:57:51 EDT
Hi Nitin, I'm adding you to this defect since sse is also involved in this deadlock.
Comment 2 Gary Karasiuk CLA 2008-08-08 05:02:15 EDT
I think I originally opened this against the wrong component, I wanted the defect to be assigned to the JPA validator.
Comment 3 David Williams CLA 2008-08-11 11:29:21 EDT
raising priority to P2 since this should be addressed/understood before we release. 
Comment 4 Brian Vosburgh CLA 2008-08-12 13:37:10 EDT
Created attachment 109808 [details]
Condensed version of dump

Contains only stack traces for Worker-13 and Worker-18.
Comment 5 Brian Vosburgh CLA 2008-08-12 13:40:56 EDT
Added Chuck to CC list - maybe he can help.

Here is what Dali is doing:

Worker-13:
The WTP validation framework has started the JpaValidator in a Job
and passed it an IProject that, I'm assuming, has a Dali facet.
The JpaValidator requests the corresponding JpaProject from 
JptCorePlugin and, since this is the first time anything has
requested the JpaProject, the GenericJpaModel builds a new one.
While building the JpaProject, Dali encounters the persistence.xml
file and instantiates a JpaXmlResourceModel, which, in turn,
gets a PersistenceArtifactEdit and asks for its resource, a
PersistenceResource. The PersistenceResource is not loaded, so
the PersistenceArtifactEdit calls load(Map), which is implemented
by TranslatorResourceImpl. This method was changed just last
week, so I'm not sure what it should be doing, but it ends
up in ModelManagerImpl trying to acquire a lock.

Worker-18:
Something called by the ModelManagerImpl has triggered a Java
change event, which Dali is trying to handle (possibly
JSPTranslation.release()); but it cannot
handle this change until the JpaProject being built in Worker-13
has been built. Unfortunately, it looks like ModelManagerImpl
probably already has the lock that it is also waiting for in
Worker-13.

Summary:
  Worker-13
    locking GenericJpaModel, building a new JpaProject
    ModelManagerImpl is waiting on the lock acquired in Worker-18
  Worker-18
    ModelManagerImpl has acquired a lock, releasing something
    waiting for GenericJpaModel, which is locked in Worker-13

To figure out how to fix this, a better understanding
of what is causing the deadlock in ModelManagerImpl would be helpful.

Maybe Chuck can help with this?
Comment 6 Chuck Bridgham CLA 2008-08-12 15:21:32 EDT
Has this been reproduced on the latest 301 driver containing my recent changes related to EMF resource loading?
Comment 7 Neil Hauge CLA 2008-08-13 10:47:39 EDT
(In reply to comment #6)
> Has this been reproduced on the latest 301 driver containing my recent changes
> related to EMF resource loading?
> 

Gary...Can you verify this?

Comment 8 Chuck Bridgham CLA 2008-08-13 11:07:55 EDT
I verified it - yes i does...

The ModelManager locks any access during load/release of a document

I'm worried about the resource change listener on the JpaModelManager
It seems this could happen from a number of threads during the 
JpaModelManager.getJpaProject() call.  Can this processing be deferred if a lock is detected? 

 at org/eclipse/jpt/core/internal/JpaModelManager.javaElementChanged(JpaModelManager.java:431)
4XESTACKTRACE          at org/eclipse/jpt/core/internal/JpaModelManager$JavaElementChangeListener.elementChanged(JpaModelManager.java:525)
4XESTACKTRACE          at org/eclipse/jdt/internal/core/DeltaProcessor$3.run(DeltaProcessor.java:1545)


Comment 9 Brian Vosburgh CLA 2008-08-13 11:25:20 EDT
We are usually notified of Java changes asynchronously in standalone jobs. Can't ModelManagerImpl release its stuff asynchronously, in dispatched jobs?
Comment 10 Nitin Dahyabhai CLA 2008-08-13 16:17:06 EDT
Created attachment 109929 [details]
possible patch

(In reply to comment #8)
> The ModelManager locks any access during load/release of a document

Usually it's only a problem when loading/releasing the same model, but in the case of discarding the model as is being done here, the ModelManager has to both manipulate the its internal record keeping for the model, the matching compilation unit, and the underlying text file buffer in a way that appears atomic.  If we absolutely had to, I could see putting the discarding ofthe compilation unit's working copy into a Job (see attached patch).  We'd never be able to do the same for creating the working copy, though.

The attached patch will not be applied unless everyone else tests it out successfully, though, and even then it is just a workaround for this one case. Adapter factories can be contributed by third parties, and their implementation could always require their own locks.
Comment 11 David Williams CLA 2008-08-13 23:44:12 EDT
Nitin, I think your suggested patch makes some conceptual sense. Here's my take on the issues here. (I'm sure there's others). 

Releasing a model _is_ done from a synchronized block .... two of them in fact, the SYNC of model manager (wich is 'global' to ModelManger, and then also on the sharedObject of ModelManger (which is specific to only the resource/model involved). [I am, btw, talking about the two points were discardModel is called, which eventually calls release on all the Adapter Factories and subsequently adapters (usually). 

And .... the first rule of synchronized blocks? ... don't call foreign code which might acquire it's own locks, unbeknownst to you. So, this sequence in MM violates that rule a ton of times, as it calls all those adapter/factory release methods. 

As far as I know, there's only two ways out of this dilemma ... one is to have a scheduling rule and require all clients acquire that rule before they call release (kind of late to require that at this point for this release!). The other is to simply spec that implementors of 'release' _can not_ lock or have their own synchronized block ... that if they are tempted to, they would need to run that work in a job, as you are suggesting for this JSP Translator case. Not the best API, but ... not sure there are other solutions. 

Note, trying to run the ModelManger's release in it's own dispatched job would, I think, just continue to move the threading problem around, and someplace else will deadlock. (because, that release could not be "free" job, to be run at absolutely any 'ol time independent of anything else ... you'd have to keep track of pending releases, and not allow new models to be created if there was a pending release for that resource ... hence, another deadlock opportunity. 

The only problem with my analysis is I have no idea of the implications of running the JSPTranslator's release in it's own job as your patch does ... will it matter to the Java Model Manager that there is a working copy pending release, when another request comes in to provide that working copy? I simply do not know. It would work, though, if it didn't matter to Java Models. 
And would, at least, provide a pattern to recommend to other clients of ModelManager if their adapter/factories are implicated in deadlocks. 

So, to resolve this bug ... how often does this happen? Sounds frequent? 
If so, I think interested parties should try your patch as you say and see if solves the immediate problem ... then there'd be a few days to test in a final 3.0.1 build?  

Or, what's the work-around? To disable JPA validation? Pretty bad, I know, but if there is any work-around, we might have to take it over these other "unknown" changes. 







Comment 12 Nitin Dahyabhai CLA 2008-08-15 11:56:16 EDT
Gary, Brian, Chuck, any news?
Comment 13 Chuck Bridgham CLA 2008-08-15 14:04:11 EDT
I have gone through the scenarios described in the adopter product testing several times, and I can't reproduce with this patch applied.  looks good
Comment 14 Chuck Bridgham CLA 2008-08-15 14:29:08 EDT
I agree we probably need to revisit these issues in the future, but since this
validation can happen asynchronously - this solution works fine.

I approve after testing.
Comment 15 David Williams CLA 2008-08-15 14:56:45 EDT
since this particular fix is in SSE, I'm moving bug the (new) WTP Source Editing project! :) 
Comment 16 Neil Hauge CLA 2008-08-15 16:18:46 EDT
On the Dali side, we haven't seen any regressions or other issues as a result of the fix.  Approving.
Comment 17 David Williams CLA 2008-08-15 16:29:49 EDT
I've committed and released the patch to the 3.0.1 maintenance branch. 

Still need to do for HEAD. 
Comment 18 David Williams CLA 2008-08-15 19:33:18 EDT
committed and released to HEAD
Comment 19 Nitin Dahyabhai CLA 2008-08-18 13:08:37 EDT
Marking as fixed in both streams
Comment 20 Gary Karasiuk CLA 2008-08-26 08:40:25 EDT
Sorry I was away on vacation so I am just verifying this defect now.

I am still seeing a deadlock, but it has moved to a different place.

It looks like now, it is only the JPA Validator that is involved (but I haven't studied this very deeply).

I will attach another javacore. It looks like Worker-2 is blocked here:

3XMTHREADINFO      "Worker-2" TID:0x3210B200, j9thread_t:0x385EC750, state:B, prio=5
at org/eclipse/jpt/core/internal/JpaModelManager.synchronizeFiles(JpaModelManager.java:295)

on a lock that is held by Worker-5. And Worker-5 is waiting for a job semaphore:

3XMTHREADINFO      "Worker-5" TID:0x32119D00, j9thread_t:0x39E62C38, state:CW, prio=1
at java/lang/Object.wait(Native Method)
at java/lang/Object.wait(Object.java:196)
at org/eclipse/core/internal/jobs/Semaphore.acquire(Semaphore.java:38)

which is probably held by Worker-2.
Comment 21 Gary Karasiuk CLA 2008-08-26 08:43:13 EDT
Created attachment 110924 [details]
javacore for reopened defect
Comment 22 Gary Karasiuk CLA 2008-08-26 08:57:26 EDT
Created attachment 110925 [details]
next attempt

I killed the process, and tried to restart the workbench. The restart deadlocked after a few seconds. This javacore is for that deadlock. It seems quite similar to the previous javacore.

Only the JPA Validator seems involved, through threads Worker-2, Worker-4 and Worker-8.
Comment 23 Nitin Dahyabhai CLA 2008-08-28 12:45:31 EDT
Gary, you should open a separate defect for the non-SSE-related deadlock mentioned in comment 20 and comment 22.
Comment 24 Nitin Dahyabhai CLA 2008-08-28 12:45:49 EDT
.
Comment 25 Gary Karasiuk CLA 2008-08-28 14:22:06 EDT
(In reply to comment #23)
> Gary, you should open a separate defect for the non-SSE-related deadlock
> mentioned in comment 20 and comment 22.
> 
I think this defect should have been transferred to the JPA Validator, since the original problem wasn't fully fixed, and not closed, but I'll open a new defect anyways.

Comment 26 Gary Karasiuk CLA 2008-08-28 14:57:28 EDT
(In reply to comment #25)
I opened bug 245573 for this problem.