Bug 307282 - ILock#acquire(int) broken after interrupt() => OrderedLocks become lost
Summary: ILock#acquire(int) broken after interrupt() => OrderedLocks become lost
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Runtime (show other bugs)
Version: 3.6   Edit
Hardware: PC All
: P3 critical (vote)
Target Milestone: 3.6 M7   Edit
Assignee: John Arthorne CLA
QA Contact:
URL:
Whiteboard:
Keywords: greatbug
: 294894 (view as bug list)
Depends on:
Blocks: 307098
  Show dependency tree
 
Reported: 2010-03-27 08:52 EDT by James Blackburn CLA
Modified: 2010-04-20 09:28 EDT (History)
1 user (show)

See Also:


Attachments
test 1 (5.08 KB, patch)
2010-03-27 08:56 EDT, James Blackburn CLA
no flags Details | Diff
patch 1 (2.59 KB, patch)
2010-03-27 08:59 EDT, James Blackburn CLA
no flags Details | Diff
test 2 (5.11 KB, patch)
2010-03-28 11:41 EDT, James Blackburn CLA
jamesblackburn+eclipse: iplog+
Details | Diff
Alternate fix (2.80 KB, patch)
2010-03-29 13:57 EDT, John Arthorne CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description James Blackburn CLA 2010-03-27 08:52:04 EDT
While working on modifying the locking in BuildManager I kept encountering issues where an OrderedLocks become unusable. There isn't deadlock; the lock is just not-owned and not acquire()able.

I could easily reproduce it in the runtime by creating lots of build jobs (all but one waiting for the build lock) and cancelling the long running build job.

With some breakpoints suspending the VM I believe I know what's going on:

Thread 1 = t1
Thread 2 = t2
Thread 3 = t3
Lock   1 = l1
Lock   2 = l2

1) t1 holds l1 && t2 holds l2
2) t2 tries to acquire l1
3) t2 is interrupted during the #acquire(int)
4) t1 releases l1
5) t3.acquire(l1) fails.

The issue is that acquire(int) can and does throw InterruptedException. Unfortunately when interrupt() occurs, the exception is thrown immediately up to the caller of #acquire(int) bypassing the cleanup that usually occurs on failure to acquire the lock.  The result is that the lock can't be acquired by subsequent threads.  [I suspect the blocking #acquire() doesn't suffer from this problem as it eats the interrupt, and doesn't return until the thread actually owns the lock.]

The interrupt() call can be generated, for example, as a result of a ProgressMonitor cancel, see InternalWorker#run()

The fix is trivial & I've got a simple test demonstrating the issue.  Will attach both when I know the bug number.
Comment 1 James Blackburn CLA 2010-03-27 08:56:42 EDT
Created attachment 163146 [details]
test 1

test for the issue (also added another tests, not referenced by the testsuite to the suite).
Comment 2 James Blackburn CLA 2010-03-27 08:59:28 EDT
Created attachment 163147 [details]
patch 1

A simple patch which simply wraps the doAcquire / acquire cleanup in a finally.
Comment 3 James Blackburn CLA 2010-03-28 11:41:39 EDT
Created attachment 163179 [details]
test 2

Small tidy to test: JavaDoc, remove unused fields.
Comment 4 John Arthorne CLA 2010-03-29 13:48:42 EDT
Thanks for finding this James, and doing such a fantastic job of creating a reproducible test case and a patch. I'm staggered that such a bug has been around so long undetected - this code is older than the hills (previously in WorkspaceLock class).
Comment 5 James Blackburn CLA 2010-03-29 13:55:47 EDT
Thanks John, I thought you'd be pleased :)

As a result I've found all sorts of cool test functionality for testing multi-threaded issues -- how did I not know about TestBarrier before the weekend?!
Comment 6 John Arthorne CLA 2010-03-29 13:57:54 EDT
Created attachment 163308 [details]
Alternate fix

I'm considering this alternate fix where I remember the thread interrupt state and rethrow later. While the finally block approach could theoretically help with other random errors that occur in that code, I find the extra finally blocks clutter the code, and add the risk of an exception thrown in the finally block causing original exceptions to be lost.
Comment 7 John Arthorne CLA 2010-03-29 14:01:05 EDT
Yep, I created TestBarrier specifically for this kind of test. It sure beats the brute force approach of "run this test 1000 times and sometimes it will fail".
Comment 8 James Blackburn CLA 2010-03-29 14:03:49 EDT
That's definitely tidier, looks good to me.
Comment 9 John Arthorne CLA 2010-03-29 14:14:55 EDT
Fix #2 released in HEAD.
Comment 10 John Arthorne CLA 2010-04-06 14:35:11 EDT
*** Bug 294894 has been marked as a duplicate of this bug. ***
Comment 11 Martin Oberhuber CLA 2010-04-20 08:56:25 EDT
John, it looks like an iplog+ is missing on "patch 1" as well as "test 2"
since James is committer on CDT and e4 only.
http://www.eclipse.org/projects/lists.php?list=projectsforcommitter&param=jblackburn

I'd add it myself but I'm not exactly sure what was finally committed.
Comment 12 James Blackburn CLA 2010-04-20 08:58:40 EDT
(In reply to comment #11)
> John, it looks like an iplog+ is missing on "patch 1" as well as "test 2"
> since James is committer on CDT and e4 only.
> http://www.eclipse.org/projects/lists.php?list=projectsforcommitter&param=jblackburn

John's fix was different from my submitted patch.
Comment 13 John Arthorne CLA 2010-04-20 09:19:16 EDT
Yes, my fix and James' test was released. I removed the "iplog-" from my patch because I'm not sure what effect that flag might have.
Comment 14 James Blackburn CLA 2010-04-20 09:23:48 EDT
(In reply to comment #13)
> Yes, my fix and James' test was released. I removed the "iplog-" from my patch
> because I'm not sure what effect that flag might have.

Ah, sorry about that. 
In CDT we use this flag for committed patches written entirely by committers which are therefore ip clean.  The person doing the pre-release IP review can then easily see that the patch has been classififed as clean, and needn't read the full bug to work out if the committer forgot to iplog+.  I guess this is different in different projects.
Comment 15 Martin Oberhuber CLA 2010-04-20 09:28:04 EDT
Thanks for handling this so quickly.

To avoid such problems in the future, I have moved forward and nominated James for platform.core commit rights :)