Bug 558655 - Auto-build job shown in Progress view after its done
Summary: Auto-build job shown in Progress view after its done
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: UI (show other bugs)
Version: 4.14   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: 4.15 M1   Edit
Assignee: Simeon Andreev CLA
QA Contact: Paul Pazderski CLA
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks: 559773
  Show dependency tree
 
Reported: 2019-12-27 14:18 EST by Simeon Andreev CLA
Modified: 2020-02-03 04:41 EST (History)
5 users (show)

See Also:


Attachments
Video showing the problem, when using breakpoints to cause a race condition. (2.90 MB, video/mp4)
2019-12-31 04:38 EST, Simeon Andreev CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simeon Andreev CLA 2019-12-27 14:18:04 EST
See https://bugs.eclipse.org/bugs/show_bug.cgi?id=553430#c5, https://bugs.eclipse.org/bugs/show_bug.cgi?id=553430#c10 resp. https://bugs.eclipse.org/bugs/attachment.cgi?id=281341.

(In reply to Andrey Loskutov from comment #5)
> I fear we have a bigger issue with this, and the test fail shows the real
> regression in 4.14.
> 
> With changes from bug 552952 and bug 552958 backported to 4.12 I see in our
> product that sometimes one "Building workspace (finished) OK" job entry
> stays "forever" in the progress view, until one either clicks on "remove all
> finished operations" button or re-opens the view. Originally this should
> never happen, the info should always disappear after the workspace build.
> 
> I've now removed those commits from our build and started re-validation
> (will take time), but since I had no other related changes backported, it is
> with the high probability a regression from from bug 552952 and bug 552958
> changes.

When we base our product on one of the latest Eclipse SDK integration builds, we sometimes see a "hanging" auto-build job in the progress view.

So far we have not reproduced this with pure Eclipse SDK functionality.
Comment 1 Simeon Andreev CLA 2019-12-27 14:21:47 EST
I've added some system outs, since the problem is *very* rare during repeated automatic testing (we create a project specific to our product and trigger a clean-build job ad-nauseam).

I see that during "ProgressManager.notifyListeners()" the job that remains in the Progress view has no listeners. Sample system outs are:

Pending addition: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]---> [org.eclipse.ui.internal.progress.AnimationManager$2@344ab465, org.eclipse.ui.internal.progress.FinishedJobs$1@5ce1b67b, org.eclipse.ui.internal.progress.ProgressViewUpdater@d05d8c3]

Pending removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]---> []

I'll look into this next week, it seems like a good start.
Comment 2 Michael Keppler CLA 2019-12-29 03:09:28 EST
I once experienced jobs not being removed from the progress view due to not returning an IStatus (but instead throwing an exception). Not sure if that can happen with the auto build job, however (and if so, where to look for exceptions in your specific build participants).
Comment 3 Simeon Andreev CLA 2019-12-29 13:55:58 EST
(In reply to Michael Keppler from comment #2)
> I once experienced jobs not being removed from the progress view due to not
> returning an IStatus (but instead throwing an exception). Not sure if that
> can happen with the auto build job, however (and if so, where to look for
> exceptions in your specific build participants).

I noticed some code that marks jobs which return an error status as kept. This then leads to seeing such a job in the Progress view. In our case though, we have a successful auto-build job that is (very rarely) still seen in the view.
Comment 4 Simeon Andreev CLA 2019-12-30 07:30:20 EST
Maybe something like this occurs:

"Worker-13: Resources Changed" #82 prio=5 os_prio=0 tid=0x00007fffa0582800 nid=0x189dc at breakpoint[0x00007fff9d0f5000]
   java.lang.Thread.State: RUNNABLE
        at org.eclipse.core.internal.jobs.InternalJob.internalSetState(InternalJob.java:302)
        at org.eclipse.core.internal.jobs.JobManager.changeState(JobManager.java:422)
        - locked <0x00000004c2418b38> (a java.lang.Object)
        - locked <0x00000004c10a7430> (a java.lang.Object)
        at org.eclipse.core.internal.jobs.JobManager.doSchedule(JobManager.java:585)
        - locked <0x00000004c10a7430> (a java.lang.Object)
        at org.eclipse.core.internal.jobs.JobManager.schedule(JobManager.java:1267)
        at org.eclipse.core.internal.jobs.InternalJob.schedule(InternalJob.java:388)
        at org.eclipse.core.runtime.jobs.Job.schedule(Job.java:700)
        at org.eclipse.core.internal.events.AutoBuildJob.build(AutoBuildJob.java:106)
        - locked <0x00000004c2418788> (a org.eclipse.core.internal.events.AutoBuildJob)
        at org.eclipse.core.internal.events.BuildManager.endTopLevel(BuildManager.java:583)
        at org.eclipse.core.internal.resources.Workspace.endOperation(Workspace.java:1517)
        at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2306)
        at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2317)
        at org.eclipse.team.svn.core.utility.ProgressMonitorUtility$1.run(ProgressMonitorUtility.java:58)
        at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)

"Worker-16: Building" #315 prio=5 os_prio=0 tid=0x00000000026d2000 nid=0x194b3 at breakpoint[0x00007fffbbdc5000]                                                                                                                             
   java.lang.Thread.State: RUNNABLE                                                                                                                                                                                                          
        at org.eclipse.ui.internal.progress.ProgressManager.removeJobInfo(ProgressManager.java:678)                                                                                                                                          
        at org.eclipse.ui.internal.progress.ProgressManager$1.done(ProgressManager.java:468)                                                                                                                                                 
        at org.eclipse.core.internal.jobs.JobListeners$$Lambda$17/864632780.notify(Unknown Source)                                                                                                                                           
        at org.eclipse.core.internal.jobs.JobListeners.doNotify(JobListeners.java:73)                                                                                                                                                        
        at org.eclipse.core.internal.jobs.JobListeners.done(JobListeners.java:118)                                                                                                                                                           
        at org.eclipse.core.internal.jobs.JobManager.endJob(JobManager.java:701)                                                                                                                                                             
        at org.eclipse.core.internal.jobs.WorkerPool.endJob(WorkerPool.java:117)                                                                                                                                                             
        at org.eclipse.core.internal.jobs.Worker.run(Worker.java:83)

I.e. one thread is scheduling an auto-build, causing the auto-build job to momentarily have a sleeping state. At the same time, "ProgressManager.removeJobInfo(JobInfo)" runs and adds no listeners due to the job having a sleeping state (see "ProgressManager.isCurrentDisplaying(Job, boolean)"). This must occur directly after "ProgressManager.notifyListeners()", so that "ProgressManager.pendingJobRemoval" is empty ("removeJobInfo()" doesn't touch already added listeners).

From that point on I'm somewhat uncertain how the race condition can unfold. I would expect that the newly scheduled auto-build job finishes, which again triggers "ProgressManager.removeJobInfo(JobInfo)", filling the pending removal list with the expected listeners. But maybe the new auto-build job has a new "JobInfo" object (so far no idea how that works), though that isn't supported by the system outs I have from the fail:

[ProgressManager        ] adding job info: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]
[ProgressManager        ] pending addition: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]---> [org.eclipse.ui.internal.progress.AnimationManager$2@344ab465, org.eclipse.ui.internal.progress.FinishedJobs$1@5ce1b67b, org.eclipse.ui.internal.progress.ProgressViewUpdater@d05d8c3]
[DetailedProgressViewer ] found item for build job: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]
removing job info: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]
[ProgressManager        ] adding job info for build: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]
[DetailedProgressViewer ] removing progress view elements:
    Setup dependency service task dispatcher(25), tasks: [], state: 0[org.eclipse.ui.internal.progress.JobInfo@41ddedce]
    Save testflow UI state.(61259)[org.eclipse.ui.internal.progress.JobInfo@4d0a34f1]
    test clean(61252)[ org.eclipse.ui.internal.progress.JobInfo@46e7128e]
[ProgressManager       ] pending removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@7be82da]---> []

The job info is added directly before the sleeping state is set on the auto-build job.

Setting breakpoints at "internalSetState.setInternalState()", "ProgressManager.notifyListeners()", "ProgressManager.addJobInfo()" and "ProgressManager.removeJobInfo()", I just reproduced the fail. But with so much going on at those breakpoints, I'll need more time to figure out what the exact sequence is (3 threads are involved). Considering the delayed "UI throttling", it might be the product of debugging the test and so slowing down some code...

As to why the problem would show after fixes for bug 552958 and bug 552952, maybe its the extra added calls to "throttledUpdate.throttledExec();". The above stack traces do seem like an old race condition, not a new one.
Comment 5 Andrey Loskutov CLA 2019-12-30 07:36:16 EST
Please note, all versions before 4.14 (or more precisely before the two commits I've mentioned earlier), never ever showed the problem. Even if we had a bug somewhere, it was uncovered now.
Comment 6 Simeon Andreev CLA 2019-12-31 04:38:28 EST
Created attachment 281359 [details]
Video showing the problem, when using breakpoints to cause a race condition.

I think I see another problem with "ProgressManager.progressFor(Job)". Its possible (again e.g. for the auto-build job) that "ProgressManager.addJobInfo(JobInfo)" reuses an existing JobInfo, during a parallel call to "ProgressManager.removeJobInfo(JobInfo)" (which will dump the JobInfo reused by addJobInfo()).

The actual removal is done based on JobInfo objects, see "DetailedProgressViewer.jobItemControls".

So once the auto-build job is done, the remove will be triggered for another JobInfo and the one used by the addJobInfo() call will be "stuck".

So the steps are more or less (1. and 2. could be simplified, no idea about the rest):

1. Have some project that takes a bit of time to compile. E.g. I used a simple generator (the path must be adjusted):
	public static void main(String[] args) throws IOException {
		String path = "/data/tmp/runtimeeclipse_local_ws/Test/src/test";
		for (int i = 0; i < 10_000; ++i) {
			String classContents = "package test; class Test" + i + " { public void foo() { System.out.println(); } }";
			Files.write(Paths.get(path, "Test" + i + ".java"), classContents.getBytes());
		}
	}

2. Add the following code, e.g. during the execution of a command:
public class SampleHandler extends AbstractHandler {

	@Override
	public Object execute(ExecutionEvent event) throws ExecutionException {
		new Job("top level") {
			@Override
			protected IStatus run(IProgressMonitor monitor) {
				for (int i = 0; i < 2; ++i) {
					WorkspaceJob j = new WorkspaceJob("some job") {
						@Override
						public IStatus runInWorkspace(IProgressMonitor monitor) throws CoreException {
							monitor.beginTask("some task", 5);
							IProject project = ResourcesPlugin.getWorkspace().getRoot().getProject("Test");
							project.build(IncrementalProjectBuilder.CLEAN_BUILD, monitor);
							return Status.OK_STATUS;
						}
					};
					j.setRule(ResourcesPlugin.getWorkspace().getRuleFactory().buildRule());
					j.schedule();
					try {
						j.join();
					} catch (InterruptedException e) {
						//
					}
				}
				return Status.OK_STATUS;
			}
			
		}.schedule();
		return null;
	}
}
3. Add breakpoints at:
    A) ProgressManager.addJobInfo(JobInfo), at first line of the method, with condition "info.getJob().getClass().getName().contains("AutoBuildJob")"
    B) ProgressManager.removeJobInfo(JobInfo), at the synchronized block (must be *after* the line "managedJobs.remove(job);"), with condition "info.getJob().getClass().getName().contains("AutoBuildJob")"
    C) at the last Job.schedule() call in the code from step 2.
4. Debug Eclipse, disable/resume breakpoints until invoking the code from step 2..
5. Resume breakpoints C) and A), until the breakpoint at B) is reached.
6. Resume breakpoint C), so that the breakpoint at A) is reached. At this point, there are parallel calls to addJobInfo() and removeJobInfo() in ProgressManager.
7. Resume breakpoints A), (there will be multiple hits) until only the breakpoint at B) remains.
8. Resume the remaining breakpoint hits. An item is now stuck at the progress view, until the view is refreshed (e.g. by changing tabs, closing/opening the view or interacting with the stuck item).
Comment 7 Simeon Andreev CLA 2019-12-31 04:42:40 EST
> 3. Add breakpoints at:
>    ...
>    C) at the last Job.schedule() call in the code from step 2.

Sorry, I meant the *first* schedule call. See also the attached video.
Comment 8 Simeon Andreev CLA 2019-12-31 05:24:37 EST
There is an asymmetry between ProgressManager.addJobInfo(JobInfo) and ProgressManager.removeJobInfo(JobInfo), which I think causes the problem. addJobInfo() will re-add the job at ProgressManager.managedJobs, but will not touch ProgressManager.runnableMonitors. Where as removeJobInfo() will remove the job from both sets.

So simplified, we have:

1. add
2. remove
3. add
4. remove

2. and 3. run in parallel, 2. will remove the auto-build job from runnableMonitors, 3. will not re-add it. 4. will therefore create a new JobInfo which was not used by 3.. We now have a "hanging" job item.

Maybe ProgressManager.runnableMonitors can replace ProgressManager.managedJobs altogether, I'm not sure.

Some more background for our use case, we use QF-Test to validate interactive performance with our product. Right now, our tests wait on the progress view to have no items after which the tests assume the build is done (QF-Test has access to SWT to some extent, but nothing else I think). So a small UI bug actually has quite an impact for us (we cannot update until we fix this). I'll look into a fix of course.

A while back I had problems with auto-build not showing in the progress view (or in the status bar), while auto-build was running. I'm not sure if this is the same problem or some similar race condition.
Comment 9 Paul Pazderski CLA 2019-12-31 05:53:46 EST
(In reply to Simeon Andreev from comment #6)
> 7. Resume breakpoints A), (there will be multiple hits) until only the
> breakpoint at B) remains.

I assume you twisted A and B at this point. If I didn't mixed it up myself it should be resume B) (the remove method) until only A) (the add method) remains.

(In reply to Simeon Andreev from comment #8)
> There is an asymmetry between ProgressManager.addJobInfo(JobInfo) and
> ProgressManager.removeJobInfo(JobInfo), which I think causes the problem.
> addJobInfo() will re-add the job at ProgressManager.managedJobs, but will
> not touch ProgressManager.runnableMonitors. Where as removeJobInfo() will
> remove the job from both sets.

It probably does not add to runnableMonitors because the progressFor method already did if necessary. (of cause works not always due to the race condition you found)

> I'll look into a fix of course.

I'll propose a fix as well. From my current understanding the worst part for your problem is the race condition between "get monitor to remove" and "remove monitor" where another thread can check for the monitor and find/get the 'not yet removed' job. Kind of the same problem as in bug 552958.
Comment 10 Eclipse Genie CLA 2019-12-31 05:58:17 EST
New Gerrit change created: https://git.eclipse.org/r/155118
Comment 11 Simeon Andreev CLA 2019-12-31 06:15:04 EST
(In reply to Paul Pazderski from comment #9)
> (In reply to Simeon Andreev from comment #6)
> > 7. Resume breakpoints A), (there will be multiple hits) until only the
> > breakpoint at B) remains.
> 
> I assume you twisted A and B at this point. If I didn't mixed it up myself
> it should be resume B) (the remove method) until only A) (the add method)
> remains.

Yes, sorry, you are correct.

> I'll propose a fix as well. From my current understanding the worst part for
> your problem is the race condition between "get monitor to remove" and
> "remove monitor" where another thread can check for the monitor and find/get
> the 'not yet removed' job. Kind of the same problem as in bug 552958.

https://git.eclipse.org/r/#/c/155118/

Looks good at first glance, I don't think the extra lock would be an issue. I'll try it out with my test. So far it took a couple of hours for a fail to occur, I'll let you know if there were any fails (probably first on 2. January). Thanks!
Comment 12 Paul Pazderski CLA 2019-12-31 06:20:13 EST
No problem. Thank you for investigating this nasty problem!

(In reply to Simeon Andreev from comment #11)
> I don't think the extra lock would be an issue.

I wouldn't even count it as extra lock since I dropped the 'Concurrent' from the hash map.
Comment 13 Simeon Andreev CLA 2020-01-01 02:45:50 EST
Still seeing a fail:

##### TEST ITERATION 192 STARTING #####
removing progress view elements:
to remove: Setup dependency service task dispatcher(25), tasks: [], state: 0[org.eclipse.ui.internal.progress.JobInfo@2a04d3d0]
to remove: Checking consistency of type history...(76)[org.eclipse.ui.internal.progress.JobInfo@543c10db]
to remove: Save testflow UI state.(2639)[org.eclipse.ui.internal.progress.JobInfo@5d212820]
adding job info: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
Auto-build job state set to SLEEP
Auto-build job state set to SLEEP
Pending addition: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]---> [org.eclipse.ui.internal.progress.AnimationManager$2@228f15de, org.eclipse.ui.internal.progress.FinishedJobs$1@2f8a82e9, org.eclipse.ui.internal.progress.ProgressViewUpdater@688e2e8d]
found item for build job: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
Auto-build job state set to SLEEP
removing job info: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
adding job info for build: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
removing progress view elements:
to remove: Setup dependency service task dispatcher(25), tasks: [], state: 0[org.eclipse.ui.internal.progress.JobInfo@470c4142]
to remove: Save testflow UI state.(2644)[org.eclipse.ui.internal.progress.JobInfo@2346b057]
to remove: test clean(2637)[org.eclipse.ui.internal.progress.JobInfo@3948cc7f]
Incoming removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
Pending removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]---> []
removing job info: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]
Incoming removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]
Pending removal: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]---> [org.eclipse.ui.internal.progress.AnimationManager$2@228f15de, org.eclipse.ui.internal.progress.FinishedJobs$1@2f8a82e9, org.eclipse.ui.internal.progress.ProgressViewUpdater@688e2e8d]
removing progress view elements:
to remove: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]
removing build job: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]
no widget found for build job: Building(4)[org.eclipse.ui.internal.progress.JobInfo@4ee76fe6]
jobItemControls element: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]
job still visible in Progress view: Building(4)[org.eclipse.ui.internal.progress.JobInfo@723c6181]

I have system outs in ProgressManager, when JobInfo objects are added/removed as well as in notifyListeners() (the lines which point to which listeners will be notified). Also in InternalJob.setInternalState(), and in DetailedProgressViewer.remove().

So we might still have the problem I mentioned earlier, if a job went to sleep during ProgressManager.removeJobInfo(). I've made a minor change before, to not care about the sleeping state during the remove call (I reverted it as there still was a failure). I'll try this change again (to know if this is the only remaining problem).
Comment 14 Simeon Andreev CLA 2020-01-02 02:44:11 EST
On top of the suggested fix by Paul, ignoring the sleeping state of the job during the remove call seems to do it. My test is running for about 24 hours without failure.

I've updated the gerrit change, though I'm not too confident about this (I'm not familiar with the code).
Comment 15 Paul Pazderski CLA 2020-01-02 08:55:17 EST
Imo your addition is fine. It is likely related to bug 518189. There are probably more questionable code bits in progress handling and the job state volatility seem to be a frequent source of trouble.
Comment 17 Paul Pazderski CLA 2020-01-06 04:00:34 EST
Thanks for your help Simeon! Glad this is done, especially since I was blamed for it and could not reproduce it. :)
Comment 18 Lars Vogel CLA 2020-02-03 04:20:38 EST
See Bug 559773 for a deadlock potentially caused by this change.