243547 – Deadlock with JPA Validator

Bug 243547 - Deadlock with JPA Validator

Summary: Deadlock with JPA Validator

Status:	CLOSED FIXED

Alias:	None

Product:	WTP Source Editing
Classification:	WebTools
Component:	wst.sse (show other bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	P2 critical (vote)
Target Milestone:	3.0.1
Assignee:	Nitin Dahyabhai
QA Contact:	Nitin Dahyabhai

URL:
Whiteboard:	PMC_approved
Keywords:

Depends on:
Blocks:

Reported:	2008-08-08 04:55 EDT by Gary Karasiuk
Modified:	2008-08-28 14:57 EDT (History)
CC List:	8 users (show)

See Also:

Flags:	david_williams: pmc_approved+ raghunathan.srinivasan: pmc_approved+ thatnitind: pmc_approved? (naci.dai) david_williams: pmc_approved? neil.hauge: pmc_approved+ thatnitind: pmc_approved? (kaloyan) thatnitind: review+ cbridgha: review+

Attachments
javacore showing deadlock (311.43 KB, application/octet-stream) 2008-08-08 04:55 EDT, Gary Karasiuk	no flags	Details
Condensed version of dump (9.14 KB, text/plain) 2008-08-12 13:37 EDT, Brian Vosburgh	no flags	Details
possible patch (2.58 KB, patch) 2008-08-13 16:17 EDT, Nitin Dahyabhai	no flags	Details \| Diff
javacore for reopened defect (219.86 KB, application/octet-stream) 2008-08-26 08:43 EDT, Gary Karasiuk	no flags	Details
next attempt (230.54 KB, application/octet-stream) 2008-08-26 08:57 EDT, Gary Karasiuk	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gary Karasiuk

2008-08-08 04:55:23 EDT

Created attachment 109502 [details]
javacore showing deadlock

I applied the patch for bug 241713 and I was in the process of verifying it. The patch fixes the original deadlock, but the deadlock has moved to a different place.

I have attached a stack trace. The deadlock is between threads Worker-13 (JPA Validator) and Worker-18 (not sure, maybe the HTML validator). 

Worker-13 is waiting for a ModelManagerImpl lock as shown by this line:
at org/eclipse/wst/sse/core/internal/model/ModelManagerImpl.isShared(ModelManagerImpl.java:1518(Compiled Code))

Worker-18 has the MML lock based on this line in it's stack trace:
at org/eclipse/wst/sse/core/internal/model/ModelManagerImpl.releaseFromRead(ModelManagerImpl.java:1832(Compiled Code))

But Worker-18 is blocked on Worker-13:
2LKMONINUSE      sys_mon_t:0x38A4A828 infl_mon_t: 0x38A4A860:
3LKMONOBJECT       org/eclipse/jpt/core/internal/GenericJpaModel@04CD36F8/04CD3704: owner "Worker-13" (0x394B7A00), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "Worker-1" (0x320F1300)
3LKWAITER                "Worker-4" (0x32124400)
3LKWAITER                "Worker-11" (0x39499000)
3LKWAITER                "Worker-18" (0x3B888100)

Comment 1 Gary Karasiuk

2008-08-08 04:57:51 EDT

Hi Nitin, I'm adding you to this defect since sse is also involved in this deadlock.

Comment 2 Gary Karasiuk

2008-08-08 05:02:15 EDT

I think I originally opened this against the wrong component, I wanted the defect to be assigned to the JPA validator.

Comment 3 David Williams

2008-08-11 11:29:21 EDT

raising priority to P2 since this should be addressed/understood before we release.

Comment 4 Brian Vosburgh

2008-08-12 13:37:10 EDT

Created attachment 109808 [details]
Condensed version of dump

Contains only stack traces for Worker-13 and Worker-18.

Comment 5 Brian Vosburgh

2008-08-12 13:40:56 EDT

Added Chuck to CC list - maybe he can help.

Here is what Dali is doing:

Worker-13:
The WTP validation framework has started the JpaValidator in a Job
and passed it an IProject that, I'm assuming, has a Dali facet.
The JpaValidator requests the corresponding JpaProject from 
JptCorePlugin and, since this is the first time anything has
requested the JpaProject, the GenericJpaModel builds a new one.
While building the JpaProject, Dali encounters the persistence.xml
file and instantiates a JpaXmlResourceModel, which, in turn,
gets a PersistenceArtifactEdit and asks for its resource, a
PersistenceResource. The PersistenceResource is not loaded, so
the PersistenceArtifactEdit calls load(Map), which is implemented
by TranslatorResourceImpl. This method was changed just last
week, so I'm not sure what it should be doing, but it ends
up in ModelManagerImpl trying to acquire a lock.

Worker-18:
Something called by the ModelManagerImpl has triggered a Java
change event, which Dali is trying to handle (possibly
JSPTranslation.release()); but it cannot
handle this change until the JpaProject being built in Worker-13
has been built. Unfortunately, it looks like ModelManagerImpl
probably already has the lock that it is also waiting for in
Worker-13.

Summary:
  Worker-13
    locking GenericJpaModel, building a new JpaProject
    ModelManagerImpl is waiting on the lock acquired in Worker-18
  Worker-18
    ModelManagerImpl has acquired a lock, releasing something
    waiting for GenericJpaModel, which is locked in Worker-13

To figure out how to fix this, a better understanding
of what is causing the deadlock in ModelManagerImpl would be helpful.

Maybe Chuck can help with this?

Comment 6 Chuck Bridgham

2008-08-12 15:21:32 EDT

Has this been reproduced on the latest 301 driver containing my recent changes related to EMF resource loading?

Comment 7 Neil Hauge

2008-08-13 10:47:39 EDT

(In reply to comment #6)
> Has this been reproduced on the latest 301 driver containing my recent changes
> related to EMF resource loading?
> 

Gary...Can you verify this?

Comment 8 Chuck Bridgham

2008-08-13 11:07:55 EDT

I verified it - yes i does...

The ModelManager locks any access during load/release of a document

I'm worried about the resource change listener on the JpaModelManager
It seems this could happen from a number of threads during the 
JpaModelManager.getJpaProject() call.  Can this processing be deferred if a lock is detected? 

 at org/eclipse/jpt/core/internal/JpaModelManager.javaElementChanged(JpaModelManager.java:431)
4XESTACKTRACE          at org/eclipse/jpt/core/internal/JpaModelManager$JavaElementChangeListener.elementChanged(JpaModelManager.java:525)
4XESTACKTRACE          at org/eclipse/jdt/internal/core/DeltaProcessor$3.run(DeltaProcessor.java:1545)

Comment 9 Brian Vosburgh

2008-08-13 11:25:20 EDT

We are usually notified of Java changes asynchronously in standalone jobs. Can't ModelManagerImpl release its stuff asynchronously, in dispatched jobs?

Comment 10 Nitin Dahyabhai

2008-08-13 16:17:06 EDT

Created attachment 109929 [details]
possible patch

(In reply to comment #8)
> The ModelManager locks any access during load/release of a document

Usually it's only a problem when loading/releasing the same model, but in the case of discarding the model as is being done here, the ModelManager has to both manipulate the its internal record keeping for the model, the matching compilation unit, and the underlying text file buffer in a way that appears atomic.  If we absolutely had to, I could see putting the discarding ofthe compilation unit's working copy into a Job (see attached patch).  We'd never be able to do the same for creating the working copy, though.

The attached patch will not be applied unless everyone else tests it out successfully, though, and even then it is just a workaround for this one case. Adapter factories can be contributed by third parties, and their implementation could always require their own locks.

Comment 11 David Williams

2008-08-13 23:44:12 EDT

Nitin, I think your suggested patch makes some conceptual sense. Here's my take on the issues here. (I'm sure there's others).

Releasing a model _is_ done from a synchronized block .... two of them in fact, the SYNC of model manager (wich is 'global' to ModelManger, and then also on the sharedObject of ModelManger (which is specific to only the resource/model involved). [I am, btw, talking about the two points were discardModel is called, which eventually calls release on all the Adapter Factories and subsequently adapters (usually).

And .... the first rule of synchronized blocks? ... don't call foreign code which might acquire it's own locks, unbeknownst to you. So, this sequence in MM violates that rule a ton of times, as it calls all those adapter/factory release methods.

As far as I know, there's only two ways out of this dilemma ... one is to have a scheduling rule and require all clients acquire that rule before they call release (kind of late to require that at this point for this release!). The other is to simply spec that implementors of 'release' _can not_ lock or have their own synchronized block ... that if they are tempted to, they would need to run that work in a job, as you are suggesting for this JSP Translator case. Not the best API, but ... not sure there are other solutions.

Note, trying to run the ModelManger's release in it's own dispatched job would, I think, just continue to move the threading problem around, and someplace else will deadlock. (because, that release could not be "free" job, to be run at absolutely any 'ol time independent of anything else ... you'd have to keep track of pending releases, and not allow new models to be created if there was a pending release for that resource ... hence, another deadlock opportunity.

The only problem with my analysis is I have no idea of the implications of running the JSPTranslator's release in it's own job as your patch does ... will it matter to the Java Model Manager that there is a working copy pending release, when another request comes in to provide that working copy? I simply do not know. It would work, though, if it didn't matter to Java Models.
And would, at least, provide a pattern to recommend to other clients of ModelManager if their adapter/factories are implicated in deadlocks.

So, to resolve this bug ... how often does this happen? Sounds frequent?
If so, I think interested parties should try your patch as you say and see if solves the immediate problem ... then there'd be a few days to test in a final 3.0.1 build?

Or, what's the work-around? To disable JPA validation? Pretty bad, I know, but if there is any work-around, we might have to take it over these other "unknown" changes.

Comment 12 Nitin Dahyabhai

2008-08-15 11:56:16 EDT

Gary, Brian, Chuck, any news?

Comment 13 Chuck Bridgham

2008-08-15 14:04:11 EDT

I have gone through the scenarios described in the adopter product testing several times, and I can't reproduce with this patch applied.  looks good

Comment 14 Chuck Bridgham

2008-08-15 14:29:08 EDT

I agree we probably need to revisit these issues in the future, but since this
validation can happen asynchronously - this solution works fine.

I approve after testing.

Comment 15 David Williams

2008-08-15 14:56:45 EDT

since this particular fix is in SSE, I'm moving bug the (new) WTP Source Editing project! :)

Comment 16 Neil Hauge

2008-08-15 16:18:46 EDT

On the Dali side, we haven't seen any regressions or other issues as a result of the fix.  Approving.

Comment 17 David Williams

2008-08-15 16:29:49 EDT

I've committed and released the patch to the 3.0.1 maintenance branch. 

Still need to do for HEAD.

Comment 18 David Williams

2008-08-15 19:33:18 EDT

committed and released to HEAD

Comment 19 Nitin Dahyabhai

2008-08-18 13:08:37 EDT

Marking as fixed in both streams

Comment 20 Gary Karasiuk

2008-08-26 08:40:25 EDT

Sorry I was away on vacation so I am just verifying this defect now.

I am still seeing a deadlock, but it has moved to a different place.

It looks like now, it is only the JPA Validator that is involved (but I haven't studied this very deeply).

I will attach another javacore. It looks like Worker-2 is blocked here:

3XMTHREADINFO      "Worker-2" TID:0x3210B200, j9thread_t:0x385EC750, state:B, prio=5
at org/eclipse/jpt/core/internal/JpaModelManager.synchronizeFiles(JpaModelManager.java:295)

on a lock that is held by Worker-5. And Worker-5 is waiting for a job semaphore:

3XMTHREADINFO      "Worker-5" TID:0x32119D00, j9thread_t:0x39E62C38, state:CW, prio=1
at java/lang/Object.wait(Native Method)
at java/lang/Object.wait(Object.java:196)
at org/eclipse/core/internal/jobs/Semaphore.acquire(Semaphore.java:38)

which is probably held by Worker-2.

Comment 21 Gary Karasiuk

2008-08-26 08:43:13 EDT

Created attachment 110924 [details]
javacore for reopened defect

Comment 22 Gary Karasiuk

2008-08-26 08:57:26 EDT

Created attachment 110925 [details]
next attempt

I killed the process, and tried to restart the workbench. The restart deadlocked after a few seconds. This javacore is for that deadlock. It seems quite similar to the previous javacore.

Only the JPA Validator seems involved, through threads Worker-2, Worker-4 and Worker-8.

Comment 23 Nitin Dahyabhai

2008-08-28 12:45:31 EDT

Gary, you should open a separate defect for the non-SSE-related deadlock mentioned in comment 20 and comment 22.

Comment 24 Nitin Dahyabhai

2008-08-28 12:45:49 EDT

Comment 25 Gary Karasiuk

2008-08-28 14:22:06 EDT

(In reply to comment #23)
> Gary, you should open a separate defect for the non-SSE-related deadlock
> mentioned in comment 20 and comment 22.
> 
I think this defect should have been transferred to the JPA Validator, since the original problem wasn't fully fixed, and not closed, but I'll open a new defect anyways.

Comment 26 Gary Karasiuk

2008-08-28 14:57:28 EDT

(In reply to comment #25)
I opened bug 245573 for this problem.