Community
Participate
Working Groups
Build Identifier: I20100608-0911 We have been seeing excessively long processing times for scenarios when we launch multiple Jobs. In our application we will usually have 1 or 2 jobs that need a workspace lock, and 1 job per project open in the workspace that needs a project lock. This is in addition to all of the jobs in base eclipse and WTP. When we import 5 or more projects at a time we see job completion times that are about 800% slower than the same code on galileo. Additionally when adding tracing to this method I see ~1 million iterations through the while(true) loop in ThreadJob#waitForRun() for a total of 15 jobs. It appears this is caused because one of our jobs is blocked by a job in the "WAITING" state. ThreadJob will only call blockingJob.jobStateLock.wait() when blockingJob is "RUNNING". Without the call to wait the while(true) loop will spin which will cause a starvation like state on all other jobs. This appears to be caused by a chain of at least 3 jobs. The running job has a project rule, a waiting job needs a workspace rule, another waiting job needs a project rule that does not conflict with the rule held by the running job. In ThreadJob#waitForRun the threadJob is the third job that needs a project rule, the blockingJob is the 2nd job that needs a workspace rule and is WAITING. This also seems to depend on the 3rd job in the chain managing it's own scheduling rule via JobManager.beginRule/endRule although I have not looked to see why that is the case. Reproducible: Always Steps to Reproduce: TestCase to be attached.
Created attachment 174678 [details] Proposed Patch
Created attachment 174680 [details] JUnit test case Test case will show longer processing time before patch is applied. Additionally add tracing in ThreadJob#waitForRun() to count iterations of the while(true) loop during this test before and after the patch. The test launches several jobs with various scheduling rules and has them do busy work for 5 to 60 seconds.
Request a review of the bug and an update on when it will be addressed. I have updated the severity to major and also the version.
We are waiting for feedback on this bug. We would like to see this issue addressed in the first maintenance release of Helios.
Created attachment 176718 [details] New Patch v1
Created attachment 176719 [details] Old junit made to run in ETF
Created attachment 176721 [details] New Junit Test
The description was a bit confusing but I've managed to parse out what the issue was. This is in fact a regression that was introduced with the yieldRule(). The yieldRule() patch also performed some code refactoring that caused this error. Previous to the refactoring, the spin loop unconditionally rested 250ms every loop iteration. This was optimized in the refactoring to only rest when the blocking job's state made was appropriate. The condition is that when beginRule() is blocked on a job, it was assumed that IF the Job was not in the the RUNNING state, the loop would soon terminate (the job had completed, and its no longer blocking.) The possibility that the job that was blocking the beginRule was also blocked was MISSED, and therefore the loop would continuously spin while in this state, stealing CPU time from all other threads. The fix is to check for WAITING (implies YIELDING/BLOCKED/ABOUT_TO_SCHEDULE) or RUNNING since both mean that it has not yet completed its execution. The original test case was a bit lengthy and hard to understand. I've improved it with a test case that can be run as a JUnit test, instead of a JUnit plugin test. Unfortunately, there is no good way to determine the pass-fail condition of this test, so it always passes. (Except when time-limit exceeds.) There just isn't any 100% accurate way to properly test that the beginRule() method is spinning out of control.
Out of curiosity, why is the case where the blocking job in SLEEPING state still allowed to spin? I have not come across a case where I have a job waiting on another job in state SLEEPING and I am not sure if it is a situation that is even possible to construct.
(In reply to comment #9) > Out of curiosity, why is the case where the blocking job in SLEEPING state > still allowed to spin? I have not come across a case where I have a job > waiting on another job in state SLEEPING and I am not sure if it is a situation > that is even possible to construct. That's right, it's not a valid state for a job to be at that point in time. For it to be choosen as a blockingJob, it needs to be in the running jobs list (which also contains only RUNNING|WAITING|BLOCKED jobs. Basically, only these jobs have the ability to block another. If one of these states is encountered, the waitForRun() method can safely rest until the state is changed from that state. If it were somehow SLEEPING|NONE, it doesn't make sense to wait at all, since a SLEEPING|NONE job can not block any job (including the active one), and therefore, the waitForRun() should exit the loop as quick as possible.
We would like this bug to be fixed in the Helios SR1 release. Please let us know if this is not possible.
(In reply to comment #8) > There just isn't any 100% accurate way to properly test that the beginRule() > method is spinning out of control. I think I thought of a way that might work. Along the lines of the original JUnit, we should time a fixed complex calculation (like Linux's BogoMips). (One that should take several seconds to calculate.) Then we save this number for later. Then run the calculation inside of a job. If the result is within some threshold of the original, then the calculation is running at full speed. If there is a difference, that means that the this bug was stealing CPU away. Not sure if this will work, but it's an idea.
(In reply to comment #8) > Previous to the refactoring, the spin loop unconditionally rested 250ms every > loop iteration. This was optimized in the refactoring to only rest when the > blocking job's state made was appropriate. It's not clear to me what case was being optimized here. If the job is blocking, it is either running, or blocked by something that is running. Therefore the blocking job should only be RUNNING or WAITING. Can't we just remove this conditional entirely (which was only added in 3.6): if (blockingJob.getState() == Job.RUNNING)
(In reply to comment #13) > (In reply to comment #8) > > Previous to the refactoring, the spin loop unconditionally rested 250ms every > > loop iteration. This was optimized in the refactoring to only rest when the > > blocking job's state made was appropriate. > > It's not clear to me what case was being optimized here. If the job is > blocking, it is either running, or blocked by something that is running. > Therefore the blocking job should only be RUNNING or WAITING. Can't we just > remove this conditional entirely (which was only added in 3.6): > > if (blockingJob.getState() == Job.RUNNING) It's an optimization when canBlock is false, if it is true, it is vital to accurately know the state of the blockingJob. The blockingJob was retrieved outside of the jobStateLock, so it's state could have changed since then. That is why it rechecked under the jobStateLock. If it blockingJob has finished running, and canBlock is true, it would be invalid to do an indefinite wait. wait(0).
I am seeing several test failures when running the tests with Min's latest patch that are passing without the patch. I don't understand the cause at this point: org.eclipse.core.tests.runtime.jobs.YieldTest testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest) junit.framework.AssertionFailedError: Timeout waiting for job to complete org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest testRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph. testJobRuleLockInteraction(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph. testRuleHierarchyWaitReplace(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph. testDetectDeadlock(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph. testMultipleColumnRemoval(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph. testBeginRuleCancelAfterWait(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Canceled rule not removed from graph. testImplicitRules(org.eclipse.core.tests.runtime.jobs.DeadlockDetectionTest) junit.framework.AssertionFailedError: Jobs not removed from graph.
(In reply to comment #14) > It's an optimization when canBlock is false, if it is true, it is vital to > accurately know the state of the blockingJob. The blockingJob was retrieved > outside of the jobStateLock, so it's state could have changed since then. That > is why it rechecked under the jobStateLock. If it blockingJob has finished > running, and canBlock is true, it would be invalid to do an indefinite wait. > wait(0). When canBlock is false I don't see much value in it. Since we have a 250ms sleep, we already know we are not responding to a job state change immediately. The best we can do here is to wake up within 250ms of the blocking job completing. That small extra window of a few extra instructions between synchronized blocks won't make much difference. However, when canBlock==true I can see that we must be very accurate about when we wait so I see the value of the extra check here. I hadn't considered that case when I made my previous comment.
Can we get this fix in for Helios SR1?
Raghu, do you (or, Andrew, originator) have any opinion about the "right" fix here? Is it the case that any of them fix your 800% performance degradation?
The patch attached by Min Idzelis on 8/16 corrects the issue for us and is a bit cleaner than the patch I had initially submitted. Although the discussion between Min and John is probably something worth working through I don't think that the outcome will affect performance much over the drastic increase that the patch already give us. I am happy with the fix as it is attached now.
(In reply to comment #17) > Can we get this fix in for Helios SR1? Unlikely. There are a couple of patches on this bug, but they introduce several test failures (see comment #15). I don't know what the correct fix is at this point and don't currently have time for further investigation. I am moving back to the inbox as I am not actively working on it. A patch that doesn't introduce test failures (or an explanation why the failures are invalid) would help.
(In reply to comment #16) > (In reply to comment #14) > > It's an optimization when canBlock is false, if it is true, it is vital to > > accurately know the state of the blockingJob. The blockingJob was retrieved > > outside of the jobStateLock, so it's state could have changed since then. That > > is why it rechecked under the jobStateLock. If it blockingJob has finished > > running, and canBlock is true, it would be invalid to do an indefinite wait. > > wait(0). > > When canBlock is false I don't see much value in it. Since we have a 250ms > sleep, we already know we are not responding to a job state change immediately. > The best we can do here is to wake up within 250ms of the blocking job > completing. That small extra window of a few extra instructions between > synchronized blocks won't make much difference. However, when canBlock==true I > can see that we must be very accurate about when we wait so I see the value of > the extra check here. I hadn't considered that case when I made my previous > comment. If I understand the question correctly I see a slight optimization when canBlock is false. In the case where canBlock is false the blockingJob.jobStateLock.wait(250) will wait a maximum of 250ms, but will return immediately when another thread calls blockingJob.jobStateLock.notify[All](). If the blocking job state had changed since it was retrieved and we did not recheck before calling wait we would be guaranteed to wait the full 250ms since no event will be generated. So I may be misunderstanding the discussion but the recheck of the blocking job's state seems to be an optimization regardless of the value of canBlock.
John, the tests pass for me before and after the patch has been applied. I am running org.eclipse.core.tests.runtime.jobs.AllTests and have 129 out of 129 passing tests. Can you verify that I am running the correct test suite.
Created attachment 180365 [details] Updated patch for fix and test The patch was stale due to recent changes in HEAD, so here is a new version of Min's patch that applies against head. I am still seeing the same test failures when running with the patch applied. I am running the class AutomatedTests in org.eclipse.core.tests.runtime, which runs all the core runtime JUnit tests.
IMO I have found the issue for testYieldJobToJobsInterleaved(org.eclipse.core.tests.runtime.jobs.YieldTest). There is a race condition where the blocked thread does not yet waiting on blockingJob.jobStateLock and the blocking thread is already in the waiting state. The result is an infinite wait of both threads. As workaround you can use something like this: blockingJob.jobStateLock.wait(canBlock && (blockingJobState == Job.RUNNING) ? 0 : 250); This is not optimal but still much better than a looping thread. BTW, the issue depends on the timing of the threads and it is pure luck if you get errors during the tests.
*** Bug 334632 has been marked as a duplicate of this bug. ***
This also affects our startup performance in some rare conditions.
(In reply to comment #25) Note that bug 334632 has a reproducable unittest / testcase for the issue we see.
Created attachment 187644 [details] Thread dump I think I just hit this issue with Eclipse 3.7m4 on my quadcore Win7-64bit host (running a 32bit JVM): Just dbl clicking a feature.xml file from a search result showed the sandclock for like 20 seconds. See attached Thread dump. I also very frequently hit similar 20 second delays when opening the Launch Configuration Dialog the first time in a session (Debug > Edit). These wait times are extremely annoying. It looks like the issue is a regression, what can I do to help fixing the issue?
Martin, I don't see a connection between this bug and your stack dump. In your trace it is busy doing API analysis in Worker-6. The main thread is blocked on that, but it is sleeping is it is not in a busy wait loop: java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0d0cd620> (a java.lang.Object) at org.eclipse.core.internal.jobs.ThreadJob.waitForRun(ThreadJob.java:272) - locked <0x0d0cd620> (a java.lang.Object) at org.eclipse.core.internal.jobs.ThreadJob.joinRun(ThreadJob.java:199) From what I understand, this bug involves a fairly rare interaction of three jobs at once to produce the busy wait state.
Comment on attachment 187644 [details] Thread dump (In reply to comment #29) Ok, I see... I noticed the "waitForRun" in combination with a not understandable delay and that triggered my imagination :) From your comment, I understand that the Thread State on top of watiForRun() would be RUNNING if it were busy waiting. Marking the attachment obsolete.
Comment on attachment 180365 [details] Updated patch for fix and test This fix deadlocks forever in the case where the "blockingJob" I am waiting on has already yielded to me.
Created attachment 188354 [details] New fix This new fix only blocks indefinitely if the blocking job is running. In this case I am guaranteed to get a notify on the jobStateLock when the blocking job exits the RUNNING state. For all other cases, I use the timed wait for 250ms. This way if the job is waiting (either waiting for me or waiting for someone else), we recompute the blocking job every 250 to make sure we are still blocked (which in 3.5 and earlier we did in all cases).
Created attachment 188356 [details] Simple JUnit illustrating the problem This test doesn't have any asserts, but is useful to illustrate the problem
Fix and test released in HEAD.
Fantastic work as always, John. I applaud your continued efforts at taming threads! Not only does your latest patch fix the problems, but it makes the code more readable as well. Great job! It still amazes me that tiny changes (such as this one) can have huge impact.
I just wish I had tracked it down earlier. The simplified test case in bug 334632, and Manfred's insight in comment #24 were a huge help.
Thanks John, from our side as well! Could we actually have this backported to the 3.6.x Stream? Doesn't have to be 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for the backport.
(In reply to comment #37) > Could we actually have this backported to the 3.6.x Stream? Doesn't have to be > 3.6.2, an M-build afterwards would suffice (3.6.2+). I can create a defect for > the backport. Sure, go ahead (3.6.2+).
*** Bug 340405 has been marked as a duplicate of this bug. ***