Bug 320329 - ThreadJob#waitForRun can spin when blockingJob is not RUNNING
Summary: ThreadJob#waitForRun can spin when blockingJob is not RUNNING
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Runtime (show other bugs)
Version: 3.6   Edit
Hardware: PC Windows XP
: P3 major (vote)
Target Milestone: 3.7 M6   Edit
Assignee: John Arthorne CLA
QA Contact:
URL:
Whiteboard: ORACLE_P1
Keywords: performance
: 334632 340405 (view as bug list)
Depends on:
Blocks: 336644
  Show dependency tree
 
Reported: 2010-07-19 17:29 EDT by Andrew McCulloch CLA
Modified: 2011-03-18 15:10 EDT (History)
15 users (show)

See Also:
min123: review? (john.arthorne)


Attachments
Proposed Patch (1.65 KB, patch)
2010-07-19 17:34 EDT, Andrew McCulloch CLA
no flags Details | Diff
JUnit test case (8.33 KB, application/octet-stream)
2010-07-19 17:40 EDT, Andrew McCulloch CLA
no flags Details
New Patch v1 (1.36 KB, patch)
2010-08-16 15:56 EDT, Min Idzelis CLA
no flags Details | Diff
Old junit made to run in ETF (7.14 KB, patch)
2010-08-16 15:57 EDT, Min Idzelis CLA
no flags Details | Diff
New Junit Test (4.35 KB, patch)
2010-08-16 16:00 EDT, Min Idzelis CLA
no flags Details | Diff
Updated patch for fix and test (5.73 KB, patch)
2010-10-06 14:32 EDT, John Arthorne CLA
no flags Details | Diff
Thread dump (46.57 KB, text/plain)
2011-01-26 10:24 EST, Martin Oberhuber CLA
no flags Details
New fix (1.46 KB, patch)
2011-02-04 14:11 EST, John Arthorne CLA
no flags Details | Diff
Simple JUnit illustrating the problem (1.89 KB, text/plain)
2011-02-04 14:20 EST, John Arthorne CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew McCulloch CLA 2010-07-19 17:29:50 EDT
Build Identifier: I20100608-0911

We have been seeing excessively long processing times for scenarios when we launch multiple Jobs.  In our application we will usually have 1 or 2 jobs that need a workspace lock, and 1 job per project open in the workspace that needs a project lock.  This is in addition to all of the jobs in base eclipse and WTP.  When we import 5 or more projects at a time we see job completion times that are about 800% slower than the same code on galileo.  Additionally when adding tracing to this method I see ~1 million iterations through the while(true) loop in ThreadJob#waitForRun() for a total of 15 jobs.  It appears this is caused because one of our jobs is blocked by a job in the "WAITING" state.  ThreadJob will only call blockingJob.jobStateLock.wait() when blockingJob is "RUNNING".  Without the call to wait the while(true) loop will spin which will cause a starvation like state on all other jobs.

This appears to be caused by a chain of at least 3 jobs.  The running job has a project rule, a waiting job needs a workspace rule, another waiting job needs a project rule that does not conflict with the rule held by the running job.  In ThreadJob#waitForRun the threadJob is the third job that needs a project rule, the blockingJob is the 2nd job that needs a workspace rule and is WAITING.  This also seems to depend on the 3rd job in the chain managing it's own scheduling rule via JobManager.beginRule/endRule although I have not looked to see why that is the case.

Reproducible: Always

Steps to Reproduce:
TestCase to be attached.
Comment 1 Andrew McCulloch CLA 2010-07-19 17:34:04 EDT
Created attachment 174678 [details]
Proposed Patch
Comment 2 Andrew McCulloch CLA 2010-07-19 17:40:44 EDT
Created attachment 174680 [details]
JUnit test case

Test case will show longer processing time before patch is applied.  Additionally add tracing in ThreadJob#waitForRun() to count iterations of the while(true) loop during this test before and after the patch.  The test launches several jobs with various scheduling rules and has them do busy work for 5 to 60 seconds.
Comment 3 Raghunathan Srinivasan CLA 2010-07-21 18:27:48 EDT
Request a review of the bug and an update on when it will be addressed. I have updated the severity to major and also the version.
Comment 4 Raghunathan Srinivasan CLA 2010-08-04 14:12:30 EDT
We are waiting for feedback on this bug. We would like to see this issue addressed in the first maintenance release of Helios.
Comment 5 Min Idzelis CLA 2010-08-16 15:56:18 EDT
Created attachment 176718 [details]
New Patch v1
Comment 6 Min Idzelis CLA 2010-08-16 15:57:39 EDT
Created attachment 176719 [details]
Old junit made to run in ETF
Comment 7 Min Idzelis CLA 2010-08-16 16:00:03 EDT
Created attachment 176721 [details]
New Junit Test
Comment 8 Min Idzelis CLA 2010-08-16 16:14:13 EDT
The description was a bit confusing but I've managed to parse out what the issue was. 

This is in fact a regression that was introduced with the yieldRule(). The  yieldRule() patch also performed some code refactoring that caused this error. 

Previous to the refactoring, the spin loop unconditionally rested 250ms every loop iteration. This was optimized in the refactoring to only rest when the blocking job's state made was appropriate. 

The condition is that when beginRule() is blocked on a job, it was assumed that IF the Job was not in the the RUNNING state, the loop would soon terminate (the job had completed, and its no longer blocking.) The possibility that the job that was blocking the beginRule was also blocked was MISSED, and therefore the loop would continuously spin while in this state, stealing CPU time from all other threads. The fix is to check for WAITING (implies YIELDING/BLOCKED/ABOUT_TO_SCHEDULE) or RUNNING since both mean that it has not yet completed its execution. 

The original test case was a bit lengthy and hard to understand. I've improved it with a test case that can be run as a JUnit test, instead of a JUnit plugin test. 

Unfortunately, there is no good way to determine the pass-fail condition of this test, so it always passes. (Except when time-limit exceeds.) 

There just isn't any 100% accurate way to properly test that the beginRule() method is spinning out of control.
Comment 9 Andrew McCulloch CLA 2010-08-16 17:12:21 EDT
Out of curiosity, why is the case where the blocking job in SLEEPING state still allowed to spin?  I have not come across a case where I have a job waiting on another job in state SLEEPING and I am not sure if it is a situation that is even possible to construct.
Comment 10 Min Idzelis CLA 2010-08-16 17:27:23 EDT
(In reply to comment #9)
> Out of curiosity, why is the case where the blocking job in SLEEPING state
> still allowed to spin?  I have not come across a case where I have a job
> waiting on another job in state SLEEPING and I am not sure if it is a situation
> that is even possible to construct.

That's right, it's not a valid state for a job to be at that point in time. For it to be choosen as a blockingJob, it needs to be in the running jobs list (which also contains only RUNNING|WAITING|BLOCKED jobs. Basically, only these jobs have the ability to block another. If one of these states is encountered, the waitForRun() method can safely rest until the state is changed from that state. 

If it were somehow SLEEPING|NONE, it doesn't make sense to wait at all, since a SLEEPING|NONE job can not block any job (including the active one), and therefore, the waitForRun() should exit the loop as quick as possible.
Comment 11 Raghunathan Srinivasan CLA 2010-08-18 13:45:33 EDT
We would like this bug to be fixed in the Helios SR1 release. Please let us know if this is not possible.
Comment 12 Min Idzelis CLA 2010-08-20 11:46:31 EDT
(In reply to comment #8)

> There just isn't any 100% accurate way to properly test that the beginRule()
> method is spinning out of control.

I think I thought of a way that might work. Along the lines of the original JUnit, we should time a fixed complex calculation (like Linux's BogoMips). (One that should take several seconds to calculate.) 

Then we save this number for later. 

Then run the calculation inside of a job. If the result is within some threshold of the original, then the calculation is running at full speed. If there is a difference, that means that the this bug was stealing CPU away. 

Not sure if this will work, but it's an idea.
Comment 13 John Arthorne CLA 2010-08-20 14:08:33 EDT
(In reply to comment #8)
> Previous to the refactoring, the spin loop unconditionally rested 250ms every
> loop iteration. This was optimized in the refactoring to only rest when the
> blocking job's state made was appropriate. 

It's not clear to me what case was being optimized here. If the job is blocking, it is either running, or blocked by something that is running. Therefore the blocking job should only be RUNNING or WAITING. Can't we just remove this conditional entirely (which was only added in 3.6):

	if (blockingJob.getState() == Job.RUNNING)
Comment 14 Min Idzelis CLA 2010-08-20 14:55:27 EDT
(In reply to comment #13)
> (In reply to comment #8)
> > Previous to the refactoring, the spin loop unconditionally rested 250ms every
> > loop iteration. This was optimized in the refactoring to only rest when the
> > blocking job's state made was appropriate. 
> 
> It's not clear to me what case was being optimized here. If the job is
> blocking, it is either running, or blocked by something that is running.
> Therefore the blocking job should only be RUNNING or WAITING. Can't we just
> remove this conditional entirely (which was only added in 3.6):
> 
>     if (blockingJob.getState() == Job.RUNNING)

It's an optimization when canBlock is false, if it is true, it is vital to accurately know the state of the blockingJob. The blockingJob was retrieved outside of the jobStateLock, so it's state could have changed since then. That is why it rechecked under the jobStateLock. If it blockingJob has finished running, and canBlock is true, it would be invalid to do an indefinite wait. wait(0).
Comment 15 John Arthorne CLA 2010-08-20 15:01:23 EDT
I am seeing several test failures when running the tests with Min's latest patch that are passing without the patch. I don't understand the cause at this point:

org.eclipse.core.tests.runtime.jobs.YieldTest
testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest)
junit.framework.AssertionFailedError: Timeout waiting for job to complete

org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest
testRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testJobRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testRuleHierarchyWaitReplace(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testDetectDeadlock(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testMultipleColumnRemoval(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testBeginRuleCancelAfterWait(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Canceled rule not removed from graph.

testImplicitRules(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.
Comment 16 John Arthorne CLA 2010-08-20 15:08:21 EDT
(In reply to comment #14)
> It's an optimization when canBlock is false, if it is true, it is vital to
> accurately know the state of the blockingJob. The blockingJob was retrieved
> outside of the jobStateLock, so it's state could have changed since then. That
> is why it rechecked under the jobStateLock. If it blockingJob has finished
> running, and canBlock is true, it would be invalid to do an indefinite wait.
> wait(0).

When canBlock is false I don't see much value in it. Since we have a 250ms sleep, we already know we are not responding to a job state change immediately. The best we can do here is to wake up within 250ms of the blocking job completing. That small extra window of a few extra instructions between synchronized blocks won't make much difference. However, when canBlock==true I can see that we must be very accurate about when we wait so I see the value of the extra check here. I hadn't considered that case when I made my previous comment.
Comment 17 Raghunathan Srinivasan CLA 2010-09-02 15:58:56 EDT
Can we get this fix in for Helios SR1?
Comment 18 David Williams CLA 2010-09-02 16:11:40 EDT
Raghu, do you (or, Andrew, originator) have any opinion about the "right" fix here? Is it the case that any of them fix your 800% performance degradation?
Comment 19 Andrew McCulloch CLA 2010-09-02 16:22:04 EDT
The patch attached by Min Idzelis on 8/16 corrects the issue for us and is a bit cleaner than the patch I had initially submitted.  Although the discussion between Min and John is probably something worth working through I don't think that the outcome will affect performance much over the drastic increase that the patch already give us.  I am happy with the fix as it is attached now.
Comment 20 John Arthorne CLA 2010-09-02 16:28:30 EDT
(In reply to comment #17)
> Can we get this fix in for Helios SR1?

Unlikely. There are a couple of patches on this bug, but they introduce several test failures (see comment #15). I don't know what the correct fix is at this point and don't currently have time for further investigation. I am moving back to the inbox as I am not actively working on it. A patch that doesn't introduce test failures (or an explanation why the failures are invalid) would help.
Comment 21 Andrew McCulloch CLA 2010-09-02 16:45:55 EDT
(In reply to comment #16)
> (In reply to comment #14)
> > It's an optimization when canBlock is false, if it is true, it is vital to
> > accurately know the state of the blockingJob. The blockingJob was retrieved
> > outside of the jobStateLock, so it's state could have changed since then. That
> > is why it rechecked under the jobStateLock. If it blockingJob has finished
> > running, and canBlock is true, it would be invalid to do an indefinite wait.
> > wait(0).
> 
> When canBlock is false I don't see much value in it. Since we have a 250ms
> sleep, we already know we are not responding to a job state change immediately.
> The best we can do here is to wake up within 250ms of the blocking job
> completing. That small extra window of a few extra instructions between
> synchronized blocks won't make much difference. However, when canBlock==true I
> can see that we must be very accurate about when we wait so I see the value of
> the extra check here. I hadn't considered that case when I made my previous
> comment.

If I understand the question correctly I see a slight optimization when canBlock is false.  In the case where canBlock is false the blockingJob.jobStateLock.wait(250) will wait a maximum of 250ms, but will return immediately when another thread calls blockingJob.jobStateLock.notify[All]().  If the blocking job state had changed since it was retrieved and we did not recheck before calling wait we would be guaranteed to wait the full 250ms since no event will be generated.  So I may be misunderstanding the discussion but the recheck of the blocking job's state seems to be an optimization regardless of the value of canBlock.
Comment 22 Andrew McCulloch CLA 2010-09-03 00:33:24 EDT
John, the tests pass for me before and after the patch has been applied.  I am running org.eclipse.core.tests.runtime.jobs.AllTests and have 129 out of 129 passing tests.  Can you verify that I am running the correct test suite.
Comment 23 John Arthorne CLA 2010-10-06 14:32:13 EDT
Created attachment 180365 [details]
Updated patch for fix and test

The patch was stale due to recent changes in HEAD, so here is a new version of Min's patch that applies against head. I am still seeing the same test failures when running with the patch applied. I am running the class AutomatedTests in org.eclipse.core.tests.runtime, which runs all the core runtime JUnit tests.
Comment 24 Manfred Klug CLA 2010-10-10 06:43:39 EDT
IMO I have found the issue for testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest).

There is a race condition where the blocked thread does not yet waiting on blockingJob.jobStateLock and the blocking thread is already in the waiting state. The result is an infinite wait of both threads.

As workaround you can use something like this:

blockingJob.jobStateLock.wait(canBlock && (blockingJobState == Job.RUNNING) ? 0 : 250);

This is not optimal but still much better than a looping thread.

BTW, the issue depends on the timing of the threads and it is pure luck if you get errors during the tests.
Comment 25 John Arthorne CLA 2011-01-21 13:28:00 EST
*** Bug 334632 has been marked as a duplicate of this bug. ***
Comment 26 Martin Oberhuber CLA 2011-01-22 10:03:52 EST
This also affects our startup performance in some rare conditions.
Comment 27 Martin Oberhuber CLA 2011-01-22 10:04:43 EST
(In reply to comment #25)
Note that bug 334632 has a reproducable unittest / testcase for the issue we see.
Comment 28 Martin Oberhuber CLA 2011-01-26 10:24:09 EST
Created attachment 187644 [details]
Thread dump

I think I just hit this issue with Eclipse 3.7m4 on my quadcore Win7-64bit host (running a 32bit JVM): Just dbl clicking a feature.xml file from a search result showed the sandclock for like 20 seconds. See attached Thread dump.

I also very frequently hit similar 20 second delays when opening the Launch Configuration Dialog the first time in a session (Debug > Edit). 

These wait times are extremely annoying. It looks like the issue is a regression, what can I do to help fixing the issue?
Comment 29 John Arthorne CLA 2011-01-26 10:56:32 EST
Martin, I don't see a connection between this bug and your stack dump. In your trace it is busy doing API analysis in Worker-6. The main thread is blocked on that, but it is sleeping is it is not in a busy wait loop:

   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0d0cd620> (a java.lang.Object)
	at org.eclipse.core.internal.jobs.ThreadJob.waitForRun(ThreadJob.java:272)
	- locked <0x0d0cd620> (a java.lang.Object)
	at org.eclipse.core.internal.jobs.ThreadJob.joinRun(ThreadJob.java:199)

From what I understand, this bug involves a fairly rare interaction of three jobs at once to produce the busy wait state.
Comment 30 Martin Oberhuber CLA 2011-01-26 11:05:32 EST
Comment on attachment 187644 [details]
Thread dump

(In reply to comment #29)

Ok, I see... I noticed the "waitForRun" in combination with a not understandable delay and that triggered my imagination :) From your comment, I understand that the Thread State on top of watiForRun() would be RUNNING if it were busy waiting.

Marking the attachment obsolete.
Comment 31 John Arthorne CLA 2011-02-04 13:56:46 EST
Comment on attachment 180365 [details]
Updated patch for fix and test

This fix deadlocks forever in the case where the "blockingJob" I am waiting on has already yielded to me.
Comment 32 John Arthorne CLA 2011-02-04 14:11:24 EST
Created attachment 188354 [details]
New fix

This new fix only blocks indefinitely if the blocking job is running. In this case I am guaranteed to get a notify on the jobStateLock when the blocking job exits the RUNNING state. For all other cases, I use the timed wait for 250ms. This way if the job is waiting (either waiting for me or waiting for someone else), we recompute the blocking job every 250 to make sure we are still blocked (which in 3.5 and earlier we did in all cases).
Comment 33 John Arthorne CLA 2011-02-04 14:20:39 EST
Created attachment 188356 [details]
Simple JUnit illustrating the problem

This test doesn't have any asserts, but is useful to illustrate the problem
Comment 34 John Arthorne CLA 2011-02-04 14:21:10 EST
Fix and test released in HEAD.
Comment 35 Min Idzelis CLA 2011-02-08 10:45:02 EST
Fantastic work as always, John. I applaud your continued efforts at taming threads!

Not only does your latest patch fix the problems, but it makes the code more readable as well. Great job!

It still amazes me that tiny changes (such as this one) can have huge impact.
Comment 36 John Arthorne CLA 2011-02-08 11:22:24 EST
I just wish I had tracked it down earlier. The simplified test case in bug 334632, and Manfred's insight in comment #24 were a huge help.
Comment 37 Martin Oberhuber CLA 2011-02-08 13:42:52 EST
Thanks John, from our side as well!

Could we actually have this backported to the 3.6.x Stream? Doesn't have to be 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for the backport.
Comment 38 John Arthorne CLA 2011-02-08 13:54:34 EST
(In reply to comment #37)
> Could we actually have this backported to the 3.6.x Stream? Doesn't have to be
> 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for
> the backport.

Sure, go ahead (3.6.2+).
Comment 39 James Blackburn CLA 2011-03-18 15:10:10 EDT
*** Bug 340405 has been marked as a duplicate of this bug. ***