320329 – ThreadJob#waitForRun can spin when blockingJob is not RUNNING

Bug 320329 - ThreadJob#waitForRun can spin when blockingJob is not RUNNING

Summary: ThreadJob#waitForRun can spin when blockingJob is not RUNNING

Status:	RESOLVED FIXED

Alias:	None

Product:	Platform
Classification:	Eclipse Project
Component:	Runtime (show other bugs)
Version:	3.6
Hardware:	PC Windows XP

Importance:	P3 major (vote)
Target Milestone:	3.7 M6
Assignee:	John Arthorne
QA Contact:

URL:
Whiteboard:	ORACLE_P1
Keywords:	performance

Duplicates (2):	334632 340405 (view as bug list)
Depends on:
Blocks:	336644
	Show dependency tree

Reported:	2010-07-19 17:29 EDT by Andrew McCulloch
Modified:	2011-03-18 15:10 EDT (History)
CC List:	15 users (show)

See Also:

Flags:	min123: review? (john.arthorne)

Attachments
Proposed Patch (1.65 KB, patch) 2010-07-19 17:34 EDT, Andrew McCulloch	no flags	Details \| Diff
JUnit test case (8.33 KB, application/octet-stream) 2010-07-19 17:40 EDT, Andrew McCulloch	no flags	Details
New Patch v1 (1.36 KB, patch) 2010-08-16 15:56 EDT, Min Idzelis	no flags	Details \| Diff
Old junit made to run in ETF (7.14 KB, patch) 2010-08-16 15:57 EDT, Min Idzelis	no flags	Details \| Diff
New Junit Test (4.35 KB, patch) 2010-08-16 16:00 EDT, Min Idzelis	no flags	Details \| Diff
Updated patch for fix and test (5.73 KB, patch) 2010-10-06 14:32 EDT, John Arthorne	no flags	Details \| Diff
Thread dump (46.57 KB, text/plain) 2011-01-26 10:24 EST, Martin Oberhuber	no flags	Details
New fix (1.46 KB, patch) 2011-02-04 14:11 EST, John Arthorne	no flags	Details \| Diff
Simple JUnit illustrating the problem (1.89 KB, text/plain) 2011-02-04 14:20 EST, John Arthorne	no flags	Details
Show Obsolete (7) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew McCulloch

2010-07-19 17:29:50 EDT

Build Identifier: I20100608-0911

We have been seeing excessively long processing times for scenarios when we launch multiple Jobs. In our application we will usually have 1 or 2 jobs that need a workspace lock, and 1 job per project open in the workspace that needs a project lock. This is in addition to all of the jobs in base eclipse and WTP. When we import 5 or more projects at a time we see job completion times that are about 800% slower than the same code on galileo. Additionally when adding tracing to this method I see ~1 million iterations through the while(true) loop in ThreadJob#waitForRun() for a total of 15 jobs. It appears this is caused because one of our jobs is blocked by a job in the "WAITING" state. ThreadJob will only call blockingJob.jobStateLock.wait() when blockingJob is "RUNNING". Without the call to wait the while(true) loop will spin which will cause a starvation like state on all other jobs.

This appears to be caused by a chain of at least 3 jobs. The running job has a project rule, a waiting job needs a workspace rule, another waiting job needs a project rule that does not conflict with the rule held by the running job. In ThreadJob#waitForRun the threadJob is the third job that needs a project rule, the blockingJob is the 2nd job that needs a workspace rule and is WAITING. This also seems to depend on the 3rd job in the chain managing it's own scheduling rule via JobManager.beginRule/endRule although I have not looked to see why that is the case.

Reproducible: Always

Steps to Reproduce:
TestCase to be attached.

Comment 1 Andrew McCulloch

2010-07-19 17:34:04 EDT

Created attachment 174678 [details]
Proposed Patch

Comment 2 Andrew McCulloch

2010-07-19 17:40:44 EDT

Created attachment 174680 [details]
JUnit test case

Test case will show longer processing time before patch is applied.  Additionally add tracing in ThreadJob#waitForRun() to count iterations of the while(true) loop during this test before and after the patch.  The test launches several jobs with various scheduling rules and has them do busy work for 5 to 60 seconds.

Comment 3 Raghunathan Srinivasan

2010-07-21 18:27:48 EDT

Request a review of the bug and an update on when it will be addressed. I have updated the severity to major and also the version.

Comment 4 Raghunathan Srinivasan

2010-08-04 14:12:30 EDT

We are waiting for feedback on this bug. We would like to see this issue addressed in the first maintenance release of Helios.

Comment 5 Min Idzelis

2010-08-16 15:56:18 EDT

Created attachment 176718 [details]
New Patch v1

Comment 6 Min Idzelis

2010-08-16 15:57:39 EDT

Created attachment 176719 [details]
Old junit made to run in ETF

Comment 7 Min Idzelis

2010-08-16 16:00:03 EDT

Created attachment 176721 [details]
New Junit Test

Comment 8 Min Idzelis

2010-08-16 16:14:13 EDT

The description was a bit confusing but I've managed to parse out what the issue was. 

This is in fact a regression that was introduced with the yieldRule(). The  yieldRule() patch also performed some code refactoring that caused this error. 

Previous to the refactoring, the spin loop unconditionally rested 250ms every loop iteration. This was optimized in the refactoring to only rest when the blocking job's state made was appropriate. 

The condition is that when beginRule() is blocked on a job, it was assumed that IF the Job was not in the the RUNNING state, the loop would soon terminate (the job had completed, and its no longer blocking.) The possibility that the job that was blocking the beginRule was also blocked was MISSED, and therefore the loop would continuously spin while in this state, stealing CPU time from all other threads. The fix is to check for WAITING (implies YIELDING/BLOCKED/ABOUT_TO_SCHEDULE) or RUNNING since both mean that it has not yet completed its execution. 

The original test case was a bit lengthy and hard to understand. I've improved it with a test case that can be run as a JUnit test, instead of a JUnit plugin test. 

Unfortunately, there is no good way to determine the pass-fail condition of this test, so it always passes. (Except when time-limit exceeds.) 

There just isn't any 100% accurate way to properly test that the beginRule() method is spinning out of control.

Comment 9 Andrew McCulloch

2010-08-16 17:12:21 EDT

Out of curiosity, why is the case where the blocking job in SLEEPING state still allowed to spin?  I have not come across a case where I have a job waiting on another job in state SLEEPING and I am not sure if it is a situation that is even possible to construct.

Comment 10 Min Idzelis

2010-08-16 17:27:23 EDT

(In reply to comment #9)
> Out of curiosity, why is the case where the blocking job in SLEEPING state
> still allowed to spin?  I have not come across a case where I have a job
> waiting on another job in state SLEEPING and I am not sure if it is a situation
> that is even possible to construct.

That's right, it's not a valid state for a job to be at that point in time. For it to be choosen as a blockingJob, it needs to be in the running jobs list (which also contains only RUNNING|WAITING|BLOCKED jobs. Basically, only these jobs have the ability to block another. If one of these states is encountered, the waitForRun() method can safely rest until the state is changed from that state. 

If it were somehow SLEEPING|NONE, it doesn't make sense to wait at all, since a SLEEPING|NONE job can not block any job (including the active one), and therefore, the waitForRun() should exit the loop as quick as possible.

Comment 11 Raghunathan Srinivasan

2010-08-18 13:45:33 EDT

We would like this bug to be fixed in the Helios SR1 release. Please let us know if this is not possible.

Comment 12 Min Idzelis

2010-08-20 11:46:31 EDT

(In reply to comment #8)

> There just isn't any 100% accurate way to properly test that the beginRule()
> method is spinning out of control.

I think I thought of a way that might work. Along the lines of the original JUnit, we should time a fixed complex calculation (like Linux's BogoMips). (One that should take several seconds to calculate.) 

Then we save this number for later. 

Then run the calculation inside of a job. If the result is within some threshold of the original, then the calculation is running at full speed. If there is a difference, that means that the this bug was stealing CPU away. 

Not sure if this will work, but it's an idea.

Comment 13 John Arthorne

2010-08-20 14:08:33 EDT

(In reply to comment #8)
> Previous to the refactoring, the spin loop unconditionally rested 250ms every
> loop iteration. This was optimized in the refactoring to only rest when the
> blocking job's state made was appropriate. 

It's not clear to me what case was being optimized here. If the job is blocking, it is either running, or blocked by something that is running. Therefore the blocking job should only be RUNNING or WAITING. Can't we just remove this conditional entirely (which was only added in 3.6):

	if (blockingJob.getState() == Job.RUNNING)

Comment 14 Min Idzelis

2010-08-20 14:55:27 EDT

(In reply to comment #13)
> (In reply to comment #8)
> > Previous to the refactoring, the spin loop unconditionally rested 250ms every
> > loop iteration. This was optimized in the refactoring to only rest when the
> > blocking job's state made was appropriate. 
> 
> It's not clear to me what case was being optimized here. If the job is
> blocking, it is either running, or blocked by something that is running.
> Therefore the blocking job should only be RUNNING or WAITING. Can't we just
> remove this conditional entirely (which was only added in 3.6):
> 
>     if (blockingJob.getState() == Job.RUNNING)

It's an optimization when canBlock is false, if it is true, it is vital to accurately know the state of the blockingJob. The blockingJob was retrieved outside of the jobStateLock, so it's state could have changed since then. That is why it rechecked under the jobStateLock. If it blockingJob has finished running, and canBlock is true, it would be invalid to do an indefinite wait. wait(0).

Comment 15 John Arthorne

2010-08-20 15:01:23 EDT

I am seeing several test failures when running the tests with Min's latest patch that are passing without the patch. I don't understand the cause at this point:

org.eclipse.core.tests.runtime.jobs.YieldTest
testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest)
junit.framework.AssertionFailedError: Timeout waiting for job to complete

org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest
testRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testJobRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testRuleHierarchyWaitReplace(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testDetectDeadlock(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testMultipleColumnRemoval(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

testBeginRuleCancelAfterWait(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Canceled rule not removed from graph.

testImplicitRules(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest)
junit.framework.AssertionFailedError: Jobs not removed from graph.

Comment 16 John Arthorne

2010-08-20 15:08:21 EDT

(In reply to comment #14)
> It's an optimization when canBlock is false, if it is true, it is vital to
> accurately know the state of the blockingJob. The blockingJob was retrieved
> outside of the jobStateLock, so it's state could have changed since then. That
> is why it rechecked under the jobStateLock. If it blockingJob has finished
> running, and canBlock is true, it would be invalid to do an indefinite wait.
> wait(0).

When canBlock is false I don't see much value in it. Since we have a 250ms sleep, we already know we are not responding to a job state change immediately. The best we can do here is to wake up within 250ms of the blocking job completing. That small extra window of a few extra instructions between synchronized blocks won't make much difference. However, when canBlock==true I can see that we must be very accurate about when we wait so I see the value of the extra check here. I hadn't considered that case when I made my previous comment.

Comment 17 Raghunathan Srinivasan

2010-09-02 15:58:56 EDT

Can we get this fix in for Helios SR1?

Comment 18 David Williams

2010-09-02 16:11:40 EDT

Raghu, do you (or, Andrew, originator) have any opinion about the "right" fix here? Is it the case that any of them fix your 800% performance degradation?

Comment 19 Andrew McCulloch

2010-09-02 16:22:04 EDT

The patch attached by Min Idzelis on 8/16 corrects the issue for us and is a bit cleaner than the patch I had initially submitted.  Although the discussion between Min and John is probably something worth working through I don't think that the outcome will affect performance much over the drastic increase that the patch already give us.  I am happy with the fix as it is attached now.

Comment 20 John Arthorne

2010-09-02 16:28:30 EDT

(In reply to comment #17)
> Can we get this fix in for Helios SR1?

Unlikely. There are a couple of patches on this bug, but they introduce several test failures (see comment #15). I don't know what the correct fix is at this point and don't currently have time for further investigation. I am moving back to the inbox as I am not actively working on it. A patch that doesn't introduce test failures (or an explanation why the failures are invalid) would help.

Comment 21 Andrew McCulloch

2010-09-02 16:45:55 EDT

(In reply to comment #16)
> (In reply to comment #14)
> > It's an optimization when canBlock is false, if it is true, it is vital to
> > accurately know the state of the blockingJob. The blockingJob was retrieved
> > outside of the jobStateLock, so it's state could have changed since then. That
> > is why it rechecked under the jobStateLock. If it blockingJob has finished
> > running, and canBlock is true, it would be invalid to do an indefinite wait.
> > wait(0).
> 
> When canBlock is false I don't see much value in it. Since we have a 250ms
> sleep, we already know we are not responding to a job state change immediately.
> The best we can do here is to wake up within 250ms of the blocking job
> completing. That small extra window of a few extra instructions between
> synchronized blocks won't make much difference. However, when canBlock==true I
> can see that we must be very accurate about when we wait so I see the value of
> the extra check here. I hadn't considered that case when I made my previous
> comment.

If I understand the question correctly I see a slight optimization when canBlock is false.  In the case where canBlock is false the blockingJob.jobStateLock.wait(250) will wait a maximum of 250ms, but will return immediately when another thread calls blockingJob.jobStateLock.notify[All]().  If the blocking job state had changed since it was retrieved and we did not recheck before calling wait we would be guaranteed to wait the full 250ms since no event will be generated.  So I may be misunderstanding the discussion but the recheck of the blocking job's state seems to be an optimization regardless of the value of canBlock.

Comment 22 Andrew McCulloch

2010-09-03 00:33:24 EDT

John, the tests pass for me before and after the patch has been applied.  I am running org.eclipse.core.tests.runtime.jobs.AllTests and have 129 out of 129 passing tests.  Can you verify that I am running the correct test suite.

Comment 23 John Arthorne

2010-10-06 14:32:13 EDT

Created attachment 180365 [details]
Updated patch for fix and test

The patch was stale due to recent changes in HEAD, so here is a new version of Min's patch that applies against head. I am still seeing the same test failures when running with the patch applied. I am running the class AutomatedTests in org.eclipse.core.tests.runtime, which runs all the core runtime JUnit tests.

Comment 24 Manfred Klug

2010-10-10 06:43:39 EDT

IMO I have found the issue for testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest).

There is a race condition where the blocked thread does not yet waiting on blockingJob.jobStateLock and the blocking thread is already in the waiting state. The result is an infinite wait of both threads.

As workaround you can use something like this:

blockingJob.jobStateLock.wait(canBlock && (blockingJobState == Job.RUNNING) ? 0 : 250);

This is not optimal but still much better than a looping thread.

BTW, the issue depends on the timing of the threads and it is pure luck if you get errors during the tests.

Comment 25 John Arthorne

2011-01-21 13:28:00 EST

*** Bug 334632 has been marked as a duplicate of this bug. ***

Comment 26 Martin Oberhuber

2011-01-22 10:03:52 EST

This also affects our startup performance in some rare conditions.

Comment 27 Martin Oberhuber

2011-01-22 10:04:43 EST

(In reply to comment #25)
Note that bug 334632 has a reproducable unittest / testcase for the issue we see.

Comment 28 Martin Oberhuber

2011-01-26 10:24:09 EST

Created attachment 187644 [details]
Thread dump

I think I just hit this issue with Eclipse 3.7m4 on my quadcore Win7-64bit host (running a 32bit JVM): Just dbl clicking a feature.xml file from a search result showed the sandclock for like 20 seconds. See attached Thread dump.

I also very frequently hit similar 20 second delays when opening the Launch Configuration Dialog the first time in a session (Debug > Edit). 

These wait times are extremely annoying. It looks like the issue is a regression, what can I do to help fixing the issue?

Comment 29 John Arthorne

2011-01-26 10:56:32 EST

Martin, I don't see a connection between this bug and your stack dump. In your trace it is busy doing API analysis in Worker-6. The main thread is blocked on that, but it is sleeping is it is not in a busy wait loop:

   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0d0cd620> (a java.lang.Object)
	at org.eclipse.core.internal.jobs.ThreadJob.waitForRun(ThreadJob.java:272)
	- locked <0x0d0cd620> (a java.lang.Object)
	at org.eclipse.core.internal.jobs.ThreadJob.joinRun(ThreadJob.java:199)

From what I understand, this bug involves a fairly rare interaction of three jobs at once to produce the busy wait state.

Comment 30 Martin Oberhuber

2011-01-26 11:05:32 EST

Comment on attachment 187644 [details]
Thread dump

(In reply to comment #29)

Ok, I see... I noticed the "waitForRun" in combination with a not understandable delay and that triggered my imagination :) From your comment, I understand that the Thread State on top of watiForRun() would be RUNNING if it were busy waiting.

Marking the attachment obsolete.

Comment 31 John Arthorne

2011-02-04 13:56:46 EST

Comment on attachment 180365 [details]
Updated patch for fix and test

This fix deadlocks forever in the case where the "blockingJob" I am waiting on has already yielded to me.

Comment 32 John Arthorne

2011-02-04 14:11:24 EST

Created attachment 188354 [details]
New fix

This new fix only blocks indefinitely if the blocking job is running. In this case I am guaranteed to get a notify on the jobStateLock when the blocking job exits the RUNNING state. For all other cases, I use the timed wait for 250ms. This way if the job is waiting (either waiting for me or waiting for someone else), we recompute the blocking job every 250 to make sure we are still blocked (which in 3.5 and earlier we did in all cases).

Comment 33 John Arthorne

2011-02-04 14:20:39 EST

Created attachment 188356 [details]
Simple JUnit illustrating the problem

This test doesn't have any asserts, but is useful to illustrate the problem

Comment 34 John Arthorne

2011-02-04 14:21:10 EST

Fix and test released in HEAD.

Comment 35 Min Idzelis

2011-02-08 10:45:02 EST

Fantastic work as always, John. I applaud your continued efforts at taming threads!

Not only does your latest patch fix the problems, but it makes the code more readable as well. Great job!

It still amazes me that tiny changes (such as this one) can have huge impact.

Comment 36 John Arthorne

2011-02-08 11:22:24 EST

I just wish I had tracked it down earlier. The simplified test case in bug 334632, and Manfred's insight in comment #24 were a huge help.

Comment 37 Martin Oberhuber

2011-02-08 13:42:52 EST

Thanks John, from our side as well!

Could we actually have this backported to the 3.6.x Stream? Doesn't have to be 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for the backport.

Comment 38 John Arthorne

2011-02-08 13:54:34 EST

(In reply to comment #37)
> Could we actually have this backported to the 3.6.x Stream? Doesn't have to be
> 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for
> the backport.

Sure, go ahead (3.6.2+).

Comment 39 James Blackburn

2011-03-18 15:10:10 EDT

*** Bug 340405 has been marked as a duplicate of this bug. ***