Bug 501997 - Investigate asynchronous installation/uninstallation of IRefreshMonitor on workspace resources
Summary: Investigate asynchronous installation/uninstallation of IRefreshMonitor on wo...
Status: REOPENED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 4.6   Edit
Hardware: PC Mac OS X
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Alex Blewitt CLA
QA Contact:
URL:
Whiteboard: performance
Keywords:
Depends on: 508419 514090 572128
Blocks: 507092 507812
  Show dependency tree
 
Reported: 2016-09-22 09:24 EDT by Mikaël Barbero CLA
Modified: 2021-03-21 06:48 EDT (History)
8 users (show)

See Also:


Attachments
Class loaded during Workspace init (63.83 KB, text/plain)
2016-09-29 11:09 EDT, Mikaël Barbero CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mikaël Barbero CLA 2016-09-22 09:24:59 EDT
When Eclipse is started and the "Refresh using native hooks or polling" option is activated, the o.e.core.resources plugin activation is stopped by the installation of native hooks or polling (see bug 501306 comment 4). If it ever takes longer than 5 seconds, OSGi class loader will time out and Eclipse would not be properly initialized. We should run the installation asynchronously and call org.eclipse.core.internal.refresh.PollingMonitor.runOnce() after the installation in order to be sure we are fully synchronized (otherwise, we can loose some change events during the monitor installation).
Comment 1 Mikaël Barbero CLA 2016-09-22 12:33:11 EDT
One interesting finding: 

We may not be able to "simply" install/uninstall refresh monitors asynchronously as it seems that AliasManager requires to be started after MonitorManager (see bug 94829). The bug does not explain all the details, but if the AliasManager requires the monitors to be installed then we will need to do both startup sequentially, in a single background task.
Comment 2 Eclipse Genie CLA 2016-09-23 08:37:46 EDT
New Gerrit change created: https://git.eclipse.org/r/81786
Comment 3 Mikaël Barbero CLA 2016-09-23 08:40:05 EDT
https://git.eclipse.org/r/81786 is a bold change: I've moved the whole workspace startup in a background process. I'm looking forward to seeing what you think about that.
Comment 4 Sergey Prigogin CLA 2016-09-23 20:24:06 EDT
(In reply to Mikaël Barbero from comment #3)
Bold indeed! God only knows what kinds of deadlocks and other problems could happen because of this change. What is your plan for testing it?
Comment 5 Mikaël Barbero CLA 2016-09-27 06:30:45 EDT
Deadlocks can happen if a thread is spawn during the background initialization of the workspace, if this thread calls ResourcesPlugin.getWorkspace() and the spawner joins on this later thread. 

AFAICT, all the workspace init code is internal and only the refresh manager delegates some processing to extensions (during startup). It means that apart from refresh manager, we have a hand on all the init code. 

Is it reasonable to add a constraint on ResourcesPlugin.getWorkspace() for internal init code and refresh providers? The constraint would be "do not call ResourcesPlugin.getWorkspace() from a different thread than your caller's and later join on this thread."

Also note that workspace jobs can still be scheduled during init because WorkManager locks the whole workspace during init.

For the testing, I simply made the init job very long to execute (sleeping for several seconds, at different locations in the init code) and checked that no error was raised. I also added a log message before joining the init job: 

getPlugin().getLog().log(new Status(IStatus.INFO, PI_RESOURCES, Thread.currentThread() + " blocked in #getWorkspace()", new Exception())); //$NON-NLS-1$

Only EMF showed up in this log:

Thread[main,6,main] blocked in #getWorkspace()
	at org.eclipse.core.resources.ResourcesPlugin.getWorkspace(ResourcesPlugin.java:425)
	at org.eclipse.emf.ecore.plugin.EcorePlugin.getWorkspaceRoot(EcorePlugin.java:1131)
	at org.eclipse.emf.ecore.resource.impl.ExtensibleURIConverterImpl.<clinit>(ExtensibleURIConverterImpl.java:393)
	at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.getURIConverter(ResourceSetImpl.java:499)
	at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.getResource(ResourceSetImpl.java:369)
	at org.eclipse.e4.ui.internal.workbench.ResourceHandler.getResource(ResourceHandler.java:286)
	at org.eclipse.e4.ui.internal.workbench.ResourceHandler.loadResource(ResourceHandler.java:262)
	at org.eclipse.e4.ui.internal.workbench.ResourceHandler.loadMostRecentModel(ResourceHandler.java:169)
	at org.eclipse.e4.ui.internal.workbench.swt.E4Application.loadApplicationModel(E4Application.java:378)
	at org.eclipse.e4.ui.internal.workbench.swt.E4Application.createE4Workbench(E4Application.java:253)
	at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:626)
	[...]
	at org.eclipse.equinox.launcher.Main.main(Main.java:1492)
Comment 6 Sergey Prigogin CLA 2016-09-27 13:58:08 EDT
(In reply to Mikaël Barbero from comment #5)
This testing looks like a good start.

> AFAICT, all the workspace init code is internal and only the refresh manager
> delegates some processing to extensions (during startup). It means that apart
> from refresh manager, we have a hand on all the init code.

This is not quite true since visible editors and views trigger activation of their host plugins and a lot of other things. Experience shows that initialization scenario varies widely depending on the set of open editors and views.  We definitely need more real life testing.
Comment 7 Mikaël Barbero CLA 2016-09-29 11:09:43 EDT
Created attachment 264496 [details]
Class loaded during Workspace init

I've investigated a bit more. With the current master of o.e.c.resources, I've modified the ResourcesPlugin#start method to sysout markers and the beginning and the end of the ResourcesPlugin#start() method. Then, I've started a runtime with verbose:class and identified which classes are loaded between the beginning and the end of the workspace init (see attachment for the list). From this list, it is easy to extract which plugins are potentially activated. Here is the list of plugins:

org.eclipse.core.contenttype_3.6.0.v20160919-1839.jar
org.eclipse.core.filesystem
org.eclipse.core.jobs_3.8.0.v20160901-1938.jar
org.eclipse.core.resources
org.eclipse.core.runtime_3.12.0.v20160919-2241.jar
org.eclipse.equinox.common_3.9.0.v20160831-0037.jar
org.eclipse.equinox.preferences_3.6.100.v20160815-1421.jar

(o.e.c.resources and o.e.c.filesystem were in my workspace, hence the name)

So the risk of deadlocking is mainly in the activator of these plugins. Here is a summary of what the plugin activators do:

org.eclipse.core.contenttype => register service
org.eclipse.core.filesystem => do nothing
org.eclipse.core.jobs => register services
org.eclipse.core.runtime => most probably already started, and anyway it just get the application args
org.eclipse.equinox.common => register services (also most probably already started)
org.eclipse.equinox.preferences => register services

It looks safe. 

(In reply to Sergey Prigogin from comment #6)
> We definitely need more real life testing.

I'm running my eclipses with this patch for a week now, and experienced no deadlock. What do you suggest for more real life testing? What about integrating a patch with a timeout (20secs?) on the join and we throw an exception, with info about a potential deadlock between the job init thread and the current thread that has reached the timeout (both with stacktraces). I'm also willing to add a property (-Ddisable.concurrent.workspace.init=true) to deactivate the background init of the workspace in case it blocks people (and add this info in the exception message). WDYT?
Comment 8 Sergey Prigogin CLA 2016-09-30 00:38:10 EDT
(In reply to Mikaël Barbero from comment #7)
The amount of testing you've done already is sufficient in my view for submitting this change to master even without the disable.concurrent.workspace.init red button.
Comment 9 Mikaël Barbero CLA 2016-10-03 07:19:47 EDT
(In reply to Sergey Prigogin from comment #8)
> (In reply to Mikaël Barbero from comment #7)
> The amount of testing you've done already is sufficient in my view for
> submitting this change to master even without the
> disable.concurrent.workspace.init red button.

Thanks. I've updated the patch with a timed join that logs a warning when the workspace init takes too long.
Comment 11 Sergey Prigogin CLA 2016-10-07 16:39:55 EDT
Crossing my fingers.
Comment 12 Andrey Loskutov CLA 2016-11-05 04:59:50 EDT
(In reply to Sergey Prigogin from comment #11)
> Crossing my fingers.

I'm sorry, crossing fingers didn't help, see bug 507092 :-(
Comment 13 Sergey Prigogin CLA 2017-03-22 20:31:55 EDT
Reopened due to the issue described in https://bugs.eclipse.org/bugs/show_bug.cgi?id=514090#c4
Comment 14 Eclipse Genie CLA 2017-03-22 20:38:47 EDT
New Gerrit change created: https://git.eclipse.org/r/93664
Comment 16 Andrey Loskutov CLA 2017-03-23 04:27:01 EDT
(In reply to Sergey Prigogin from comment #13)
> Reopened due to the issue described in
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=514090#c4

Doe this mean, we don't plan to init ResourcesPlugin asynchronously anymore? If yes, we can close bug 508419.
Comment 17 Sergey Prigogin CLA 2017-03-23 14:20:38 EDT
> Doe this mean, we don't plan to init ResourcesPlugin asynchronously anymore?
> If yes, we can close bug 508419.

Asynchronous initialization of ResourcesPlugin is still desirable but it cannot be done safely unless suspension of the job manager is removed.
Comment 18 Jens Kuebler CLA 2017-05-24 04:11:16 EDT
I'm wondering why not use a osgi component for the workspace. It is registered as service anyways. Providing a configuration for the storage locations seems to be straight forward. If configuration should be non persistent the component factory comes in handy as well.

I can't see that for creating the service a scheduling rule is required so the job manager should not be necessary at all and the osgi scr handles startup with its threading model.
Although not the best solution the static getWorkspace could do a service lookup and return null if the service is not yet registered.
Comment 19 Eclipse Genie CLA 2017-05-24 07:12:07 EDT
New Gerrit change created: https://git.eclipse.org/r/97851
Comment 20 Jens Kuebler CLA 2017-05-24 07:18:51 EDT
I don't wanna tamper with with the initial approach. I just think that due to the numerous cycle dependencies in this plugin more needs to be refactored. Most of the time passing the arguments seems totally fine to me.

The service lookup is certainly not safe at the moment but adding more time to the lookup or using an event based method seems totally fine to me.

The presented patch is a stub for further discussion and not production ready.
Comment 21 Mikaël Barbero CLA 2017-05-26 05:02:09 EDT
(In reply to Jens  Kuebler from comment #18)
> Although not the best solution the static getWorkspace could do a service
> lookup and return null if the service is not yet registered.

The issue is that *many* plugins expect the workspace to be registered/initialized when the they call getWorskpace() (or throw an IllegalStateException).
Comment 22 Jens Kuebler CLA 2017-05-29 01:16:44 EDT
I think the design problem that the old implementation had, was that the lifecycle of the bundle and the associated class loading for the bundle was coupled with initialization of the workbench. The initialization may only happen after it has been configured (either by the application, the command line or a user interaction). However this forbids loading classes from that bundle until configuration has happened. Using the Jobs API allows for using scheduling rules and doing things asynchronously. However it does not guarantee that the workbench location has been configured unless done programmatically. Currently this seems like choosing between the lesser of two evils, since the workbench crashes if classes from the resource bundle are loaded too early. On the other hand consumers receive NPEs if the workbench has not been initialized yet. Switching to an event bases system seems reasonable for me and since the OSGi service seems to be a natural fit I would go for this, even if clients relied on runtime behavior. Maybe a combination with a job / scheduling rule during initialization is reasonable since it needs to lock the workspace anyways then.
Clients would have a clean migration path since they all must wait on the workbench to become available until they may proceed or an error will occur.
Maybe providing a legacy mode is another option here.
Comment 23 Alex Blewitt CLA 2020-07-02 11:49:27 EDT
There hasn't been any work on this recently. I'm happy to take it over if there aren't any objections?
Comment 24 Mikaël Barbero CLA 2020-07-02 11:50:48 EDT
(In reply to Alex Blewitt from comment #23)
> There hasn't been any work on this recently. I'm happy to take it over if
> there aren't any objections?

Glad you have a look.