307282 – ILock#acquire(int) broken after interrupt() => OrderedLocks become lost

Bug 307282 - ILock#acquire(int) broken after interrupt() => OrderedLocks become lost

Summary: ILock#acquire(int) broken after interrupt() => OrderedLocks become lost

Status:	RESOLVED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Runtime (show other bugs)
Version:	3.6
Hardware:	PC All

Importance:	P3 critical (vote)
Target Milestone:	3.6 M7
Assignee:	John Arthorne
QA Contact:

URL:
Whiteboard:
Keywords:	greatbug

Duplicates (1):	294894 (view as bug list)
Depends on:
Blocks:	307098
	Show dependency tree

Reported:	2010-03-27 08:52 EDT by James Blackburn
Modified:	2010-04-20 09:28 EDT (History)
CC List:	1 user (show)

See Also:

Attachments
test 1 (5.08 KB, patch) 2010-03-27 08:56 EDT, James Blackburn	no flags	Details \| Diff
patch 1 (2.59 KB, patch) 2010-03-27 08:59 EDT, James Blackburn	no flags	Details \| Diff
test 2 (5.11 KB, patch) 2010-03-28 11:41 EDT, James Blackburn	jamesblackburn+eclipse: iplog+	Details \| Diff
Alternate fix (2.80 KB, patch) 2010-03-29 13:57 EDT, John Arthorne	no flags	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description James Blackburn

2010-03-27 08:52:04 EDT

While working on modifying the locking in BuildManager I kept encountering issues where an OrderedLocks become unusable. There isn't deadlock; the lock is just not-owned and not acquire()able.

I could easily reproduce it in the runtime by creating lots of build jobs (all but one waiting for the build lock) and cancelling the long running build job.

With some breakpoints suspending the VM I believe I know what's going on:

Thread 1 = t1
Thread 2 = t2
Thread 3 = t3
Lock   1 = l1
Lock   2 = l2

1) t1 holds l1 && t2 holds l2
2) t2 tries to acquire l1
3) t2 is interrupted during the #acquire(int)
4) t1 releases l1
5) t3.acquire(l1) fails.

The issue is that acquire(int) can and does throw InterruptedException. Unfortunately when interrupt() occurs, the exception is thrown immediately up to the caller of #acquire(int) bypassing the cleanup that usually occurs on failure to acquire the lock.  The result is that the lock can't be acquired by subsequent threads.  [I suspect the blocking #acquire() doesn't suffer from this problem as it eats the interrupt, and doesn't return until the thread actually owns the lock.]

The interrupt() call can be generated, for example, as a result of a ProgressMonitor cancel, see InternalWorker#run()

The fix is trivial & I've got a simple test demonstrating the issue.  Will attach both when I know the bug number.

Comment 1 James Blackburn

2010-03-27 08:56:42 EDT

Created attachment 163146 [details]
test 1

test for the issue (also added another tests, not referenced by the testsuite to the suite).

Comment 2 James Blackburn

2010-03-27 08:59:28 EDT

Created attachment 163147 [details]
patch 1

A simple patch which simply wraps the doAcquire / acquire cleanup in a finally.

Comment 3 James Blackburn

2010-03-28 11:41:39 EDT

Created attachment 163179 [details]
test 2

Small tidy to test: JavaDoc, remove unused fields.

Comment 4 John Arthorne

2010-03-29 13:48:42 EDT

Thanks for finding this James, and doing such a fantastic job of creating a reproducible test case and a patch. I'm staggered that such a bug has been around so long undetected - this code is older than the hills (previously in WorkspaceLock class).

Comment 5 James Blackburn

2010-03-29 13:55:47 EDT

Thanks John, I thought you'd be pleased :)

As a result I've found all sorts of cool test functionality for testing multi-threaded issues -- how did I not know about TestBarrier before the weekend?!

Comment 6 John Arthorne

2010-03-29 13:57:54 EDT

Created attachment 163308 [details]
Alternate fix

I'm considering this alternate fix where I remember the thread interrupt state and rethrow later. While the finally block approach could theoretically help with other random errors that occur in that code, I find the extra finally blocks clutter the code, and add the risk of an exception thrown in the finally block causing original exceptions to be lost.

Comment 7 John Arthorne

2010-03-29 14:01:05 EDT

Yep, I created TestBarrier specifically for this kind of test. It sure beats the brute force approach of "run this test 1000 times and sometimes it will fail".

Comment 8 James Blackburn

2010-03-29 14:03:49 EDT

That's definitely tidier, looks good to me.

Comment 9 John Arthorne

2010-03-29 14:14:55 EDT

Fix #2 released in HEAD.

Comment 10 John Arthorne

2010-04-06 14:35:11 EDT

*** Bug 294894 has been marked as a duplicate of this bug. ***

Comment 11 Martin Oberhuber

2010-04-20 08:56:25 EDT

John, it looks like an iplog+ is missing on "patch 1" as well as "test 2"
since James is committer on CDT and e4 only.
http://www.eclipse.org/projects/lists.php?list=projectsforcommitter&param=jblackburn

I'd add it myself but I'm not exactly sure what was finally committed.

Comment 12 James Blackburn

2010-04-20 08:58:40 EDT

(In reply to comment #11)
> John, it looks like an iplog+ is missing on "patch 1" as well as "test 2"
> since James is committer on CDT and e4 only.
> http://www.eclipse.org/projects/lists.php?list=projectsforcommitter&param=jblackburn

John's fix was different from my submitted patch.

Comment 13 John Arthorne

2010-04-20 09:19:16 EDT

Yes, my fix and James' test was released. I removed the "iplog-" from my patch because I'm not sure what effect that flag might have.

Comment 14 James Blackburn

2010-04-20 09:23:48 EDT

(In reply to comment #13)
> Yes, my fix and James' test was released. I removed the "iplog-" from my patch
> because I'm not sure what effect that flag might have.

Ah, sorry about that. 
In CDT we use this flag for committed patches written entirely by committers which are therefore ip clean.  The person doing the pre-release IP review can then easily see that the patch has been classififed as clean, and needn't read the full bug to work out if the committer forgot to iplog+.  I guess this is different in different projects.

Comment 15 Martin Oberhuber

2010-04-20 09:28:04 EDT

Thanks for handling this so quickly.

To avoid such problems in the future, I have moved forward and nominated James for platform.core commit rights :)