Bug 478634 - Infinite Autobuild loop with a job never removed from waitingThreadJobs
Summary: Infinite Autobuild loop with a job never removed from waitingThreadJobs
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Runtime (show other bugs)
Version: 4.5   Edit
Hardware: All All
: P3 critical with 4 votes (vote)
Target Milestone: 4.7.3   Edit
Assignee: Andrey Loskutov CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 517725 527962 (view as bug list)
Depends on: 528227
Blocks:
  Show dependency tree
 
Reported: 2015-09-29 10:07 EDT by Rastislav Wagner CLA
Modified: 2018-01-16 07:24 EST (History)
14 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rastislav Wagner CLA 2015-09-29 10:07:50 EDT
Sometimes I end up in infinite job loop after I delete project.

Here's my workspace if you want to reproduce:
https://drive.google.com/file/d/0B46utPjKw-LmckRmeWxaMDd5cE0/view?usp=sharing

Steps to reproduce:
1.expand project "a" in Project explorer
2.delete project
You might end up in infinite job loop - its not always reproducible

I have taken a look and here's what happens:
1. after project is deleted AutoBuildJob is run - but it conflicts with some ThreadJob(null,[P/a]) note: I deleted project "a"
2. because of that, AutoBuildJob is cancelled and rescheduled with 100ms delay. However ThreadJob(null,[P/a]) never ends so there's a loop
Comment 1 Rastislav Wagner CLA 2015-09-29 10:16:14 EDT
Patch: 
https://git.eclipse.org/r/#/c/56954/

I think that problem is in ThreadJob#waitForRun method. On line 211 we put this ThreadJob to waiting queue and what happens next is that we will get out of 'while (true)' loop (line 231) with case #3 (based on comment on line 228), but in 'finally' block we won't remove this ThreadJob from waiting queue (since #3 happend) - so it's stuck there forever.
Comment 2 Eclipse Genie CLA 2015-09-30 00:00:11 EDT
New Gerrit change created: https://git.eclipse.org/r/56954
Comment 3 Max Rydahl Andersen CLA 2015-09-30 06:44:09 EDT
related jboss tools jira: https://issues.jboss.org/browse/JBIDE-19536
Comment 4 John Arthorne CLA 2015-10-05 15:20:30 EDT
Rastislav I am trying to understand your fix. I don't have a way to reproduce the problem, but if the problem is not consistently reproducible it smells like a race condition or synchronization issue, and your fix doesn't seem like a race codition fix. In the failure case, what doe the fields "result" and "threadJob" contain? If we have obtained a lock different from the one we asked for, that sounds scary unless we have actually obtained a parent lock that includes the lock we needed. If it is the parent case, then maybe we need a lock containment check to replace the equality check.

Also the condition "threadJob == result" is checked twice in the finally block, and I would like to understand why it is ok to remove one test but not the other. If we call waitEnd(job, false, monitor) then we are also failing to clear a lock wait state.
Comment 5 Rastislav Wagner CLA 2015-10-06 06:41:07 EDT
Hi John,
status in failure is:

threadJob  = ThreadJob(null,[P/a])
blockingJob = ThreadJob(null,[R/])
result = ThreadJob(null,[R/,P/a])


Calling waitEnd(job, true, monitor) would in this case do nothing because you would get through 
- org.eclipse.core.internal.jobs.LockManager line 304
to org.eclipse.core.internal.jobs.DeadlockDetector which would return on line 431. 
So I kept that check. Im not very skilled in jobs part of eclipse so I did the smallest change which resolved my problem. However I'm not sure if its the right fix.

I have to say that with a workspace I've posted, the problem is 100% reproducible for me and also for other people that tried it.
Comment 6 Ondrej Dockal CLA 2016-08-24 11:33:40 EDT
I experienced this issue with job loop after deleting project. Is there any progress in this issue? 

Thanks.

Eclipse platform where I reproduced the bug:
Version: 4.6.0.v20160606-1342
Build id: I20160606-1100
Comment 7 Rastislav Wagner CLA 2016-12-07 10:29:27 EST
Im still facing this issue from time to time - new patch uploaded. 

This is a comment in ThreadJob.java file:

Ultimately, this loop will wait until the job "runs" (acquires the rule)
or is canceled. However, there are many ways for that to occur.
The exit conditions of this loop are:
1) This job no longer conflicts with any other running job.
2) The LockManager#aboutToWait allowed this thread to run.
This usually occurs during reentrant situations. i.e. This is a UI thread,
and a syncExec is performed from conflicting job/thread.
3) A rule is transfered to this thread. This can be invoked programmatically,
or commonly in JFace via ModalContext (for wizards/etc).
4) Monitor is canceled.


This lock that I am experiencing is condition 3). In this case returned "result" ThreadJob is not the same as the given thread job which results in not removing given thread job from waiting jobs.

John, can you please take a look ? It is 100% reproducible with attached workspace & steps to reproduce. Thank you.
Comment 8 Lars Vogel CLA 2017-01-23 11:48:52 EST
Please set a milestone target again, if you plan to work on this.
Comment 9 Rastislav Wagner CLA 2017-01-24 04:06:22 EST
pardon my ignorance but who is supposed to set target milestone ? I have rebased the fix. Is this going to be reviewed for M6 ?
Comment 10 Rastislav Wagner CLA 2017-03-13 08:17:15 EDT
any progress ?
Comment 11 Rastislav Wagner CLA 2017-04-05 08:06:02 EDT
any chance this will be reviewed for Oxygen ?
Comment 12 Andrey Loskutov CLA 2017-04-05 08:40:17 EDT
(In reply to Rastislav Wagner from comment #11)
> any chance this will be reviewed for Oxygen ?

Rastislav, small hint: the chance would be higher if there would be a junit test case along with the change.

@Stefan, are you familiar with the ThreadJob code?
Comment 13 Stefan Xenos CLA 2017-04-05 13:21:16 EDT
> @Stefan, are you familiar with the ThreadJob code?

I've written a couple patches for it, but the JDT index is currently on fire and my review backlog is severe. It's unlikely I can get to this for M7.
Comment 14 Stefan Xenos CLA 2017-04-05 21:40:22 EDT
Actually, the patch looks simple. I'll try to make time for it this milestone.
Comment 15 Stefan Xenos CLA 2017-04-05 22:20:22 EDT
Sorry, Rastislav. I suspect the reason nobody has reviewed this patch in so long is because nobody understands this code. It's entirely possible you're the only one who has figured it out, so bear with me.

The patch is nice and simple, but I had a number of questions:

- Why would we avoid passing true to waitEnd if resultRule == threadJobRule and there is no scheduling rule attached to it?

- Why is it important that one scheduling rule contains the other, and why is rule containment more important than job identity for the purposes of this wait stack? I'm concerned that the containment relationship may be a side-effect of the real invariant we need to preserve.

- From the point of view of the Jobs framework, which API is misbehaving and how, specifically?

- Which piece of data is being leaked that gets cleaned up by this patch? The entry in implicitJobs? The ThreadJob.isBlocked flag? Both?

So let's do this:

1. Please write a unit test that demonstrates the problem directly (something that just uses the Jobs framework and doesn't rely on deleting and expanding projects). That will make it a lot easier for me (and future maintainers) to understand the contract on Jobs and how this patch supports it.

2. Let me know your time zone and availability and I'll make myself available on Eclipse Mattermost. Alternatively, we can do a video chat on google hangouts. You can walk me through the problem and explain how the patch addresses it... and I can answer any questions you may have about writing the unit test.
Comment 16 Rastislav Wagner CLA 2017-04-06 10:05:45 EDT
Hi Stefan,


(In reply to Stefan Xenos from comment #15)
> Sorry, Rastislav. I suspect the reason nobody has reviewed this patch in so
> long is because nobody understands this code. It's entirely possible you're
> the only one who has figured it out, so bear with me.

thank you for finding a time to check the code! Maybe Im completely wrong :)


> - Why would we avoid passing true to waitEnd if resultRule == threadJobRule
> and there is no scheduling rule attached to it?

I think you are right, I should pass true in that case too.

> 
> - Why is it important that one scheduling rule contains the other, and why
> is rule containment more important than job identity for the purposes of
> this wait stack? I'm concerned that the containment relationship may be a
> side-effect of the real invariant we need to preserve.
> 
> - From the point of view of the Jobs framework, which API is misbehaving and
> how, specifically?
> 
> - Which piece of data is being leaked that gets cleaned up by this patch?
> The entry in implicitJobs? The ThreadJob.isBlocked flag? Both?

Looking at ThreadJob.waitForRun. threadJob is added to waiting thread jobs[1] and then we enter the while loop[2]. Comment above it states that there are 4 ways to break the loop.
If we break the loop with option #1, #2 or #4 then the threadJob is removed from waiting thread jobs[3]. If we break the loop with option 3 then it is not and I believe that is stays among waiting thread jobs forever. 

Now when I run another job with same or conflicting rule and call isBlocking() it returns true because threadjob is among waiting thread jobs[4].

This patch removes threadJob from waiting thread jobs even if we break the loop with option #3


> 
> So let's do this:
> 
> 1. Please write a unit test that demonstrates the problem directly
> (something that just uses the Jobs framework and doesn't rely on deleting
> and expanding projects). That will make it a lot easier for me (and future
> maintainers) to understand the contract on Jobs and how this patch supports
> it.

I created unit test that fails with unpatched ThreadJob class and passes after the fix is applied. Gerrit change updated.
> 
> 2. Let me know your time zone and availability and I'll make myself
> available on Eclipse Mattermost. Alternatively, we can do a video chat on
> google hangouts. You can walk me through the problem and explain how the
> patch addresses it... and I can answer any questions you may have about
> writing the unit test.


My timezone is CET. We can talk anytime between 8am-7pm CET. 

[1]https://github.com/eclipse/eclipse.platform.runtime/blob/master/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/ThreadJob.java#L228

[2]https://github.com/eclipse/eclipse.platform.runtime/blob/master/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/ThreadJob.java#L248

[3]https://github.com/eclipse/eclipse.platform.runtime/blob/master/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/ThreadJob.java#L308

[4]https://github.com/eclipse/eclipse.platform.runtime/blob/master/bundles/org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs/JobManager.java#L834
Comment 17 Dani Megert CLA 2017-05-10 12:37:16 EDT
Stefan, can you take a look for RC1?
Comment 18 Stefan Xenos CLA 2017-05-10 12:44:32 EDT
> Stefan, can you take a look for RC1?

No, sorry. Although this patch is simple, understanding it in context will take more time than I have available before I leave the Eclipse project... and having me figure it out would be wasteful since that knowledge would leave with me.

Since the hard part is figuring out the Jobs framework innards, I'd suggest that whoever looks at this also tackles bug 509557 since they'll touch the same code.
Comment 19 Josef Kopriva CLA 2017-08-04 04:55:44 EDT
Hi all, 

any progress on this issue?

Thanks
Josef
Comment 20 Vasili Gulevich CLA 2017-08-22 07:54:19 EDT
*** Bug 517725 has been marked as a duplicate of this bug. ***
Comment 21 Vasili Gulevich CLA 2017-08-22 07:57:09 EDT
Bug 517725 provides another test case using real components from Platform UI, but is a duplicate of this issue.
Comment 22 Andrey Loskutov CLA 2017-12-05 17:38:05 EST
*** Bug 527962 has been marked as a duplicate of this bug. ***
Comment 23 Andrey Loskutov CLA 2017-12-05 17:47:25 EST
Patch https://git.eclipse.org/r/#/c/112884/ from bug 527962 provides another test.

The bug is a side effect from the fix for bug 60964 (which seem to be correct, but just reveals this one).
Comment 24 Andrey Loskutov CLA 2017-12-05 18:15:48 EST
(In reply to Rastislav Wagner from comment #16)
> Looking at ThreadJob.waitForRun. threadJob is added to waiting thread
> jobs[1] and then we enter the while loop[2]. Comment above it states that
> there are 4 ways to break the loop.
> If we break the loop with option #1, #2 or #4 then the threadJob is removed
> from waiting thread jobs[3]. If we break the loop with option 3 then it is
> not and I believe that is stays among waiting thread jobs forever. 

Yep, this seem to be the case.
 
> Now when I run another job with same or conflicting rule and call
> isBlocking() it returns true because threadjob is among waiting thread
> jobs[4].
> 
> This patch removes threadJob from waiting thread jobs even if we break the
> loop with option #3

So far this is the best explanation why the bug happens. I'm still struggling with the patch, I believe that option #3 deserves some extra handling and should not be mixed in one condition with others, but I'm still trying to understand what all this ThreadJob code is doing.

So this is the original part which does not handle case #3:

waitEnd(threadJob, threadJob == result, monitor);
if (threadJob == result) {
	if (waiting) manager.implicitJobs.removeWaiting(threadJob);
}

and this is the proposal from https://git.eclipse.org/r/#/c/56954

ISchedulingRule resultRule = result.getRule();
ISchedulingRule threadJobRule = threadJob.getRule();
boolean containsRule = resultRule != null && threadJobRule != null && resultRule.contains(threadJobRule);
waitEnd(threadJob, containsRule, monitor);
if (containsRule) {
	if (waiting) manager.implicitJobs.removeWaiting(threadJob);
}

The change from (threadJob == result) to (containsRule) looks too dangerous for me. I would feel better if we mark the option #3 with an extra flag and consider this flag in an extra branch in the finally block.
Comment 25 Andrey Loskutov CLA 2017-12-06 18:24:42 EST
Rastislav, it would be cool if you could review updated patch and also patch for bug 528227 (https://git.eclipse.org/r/112973).

The updated patch is doing basically the same as yours, but is more conservative and uses the fact that the rule containment is already checked by push().
Comment 26 Rastislav Wagner CLA 2017-12-07 03:58:19 EST
@Andrey patch looks good to me!
Comment 28 Andrey Loskutov CLA 2017-12-08 10:57:28 EST
(In reply to Eclipse Genie from comment #27)
> Gerrit change https://git.eclipse.org/r/56954 was merged to [master].
> Commit:
> http://git.eclipse.org/c/platform/eclipse.platform.runtime.git/commit/
> ?id=d891f5524964c2901e4710e92389e6608cd21ff1

I've merged now the patch to master, it should be available in the next nightly build. I'm not closing this bug for now, I want to validate if the tests from other components will be affected by the change (I can see it only on next full platform build).
Comment 29 Andrey Loskutov CLA 2017-12-12 06:57:55 EST
I haven't seen any related test fails in recent SDK builds, so we can close this bug OR we can backport this to 4.7.3.

I propose to backport this, because the build loop is never a good idea.
Any objections?
Comment 30 Lars Vogel CLA 2017-12-12 09:11:37 EST
(In reply to Andrey Loskutov from comment #29)
> I propose to backport this, because the build loop is never a good idea.
> Any objections?

+1 for a downport
Comment 31 Eclipse Genie CLA 2017-12-13 02:36:30 EST
New Gerrit change created: https://git.eclipse.org/r/113299
Comment 32 Eclipse Genie CLA 2017-12-13 02:41:45 EST
New Gerrit change created: https://git.eclipse.org/r/113300
Comment 34 Dani Megert CLA 2018-01-15 12:42:56 EST
Did you check whether the bundle version needs to be increased?
Comment 35 Andrey Loskutov CLA 2018-01-15 13:00:41 EST
(In reply to Dani Megert from comment #34)
> Did you check whether the bundle version needs to be increased?

My bad, I saw commit with pom version increase but this was not the bundle versions increase. I will push bundle version increment today.
Comment 36 Eclipse Genie CLA 2018-01-15 13:22:10 EST
New Gerrit change created: https://git.eclipse.org/r/115409