Bug 571568 - Move publication of the IWorkspace service to a separate thread, outside the activator
Summary: Move publication of the IWorkspace service to a separate thread, outside the ...
Status: NEW
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 4.20   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Alex Blewitt CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 572128
Blocks:
  Show dependency tree
 
Reported: 2021-02-27 18:05 EST by Alex Blewitt CLA
Modified: 2021-03-19 15:37 EDT (History)
3 users (show)

See Also:


Attachments
Zip of sample project (5.38 KB, application/zip)
2021-02-27 18:05 EST, Alex Blewitt CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Blewitt CLA 2021-02-27 18:05:26 EST
Created attachment 285686 [details]
Zip of sample project

The resources plugin registers an IWorkspace service (see bug 264064) in the Resources activator.

https://github.com/eclipse/eclipse.platform.resources/blob/434556bcae52a8bc1d9476d2574cc65821f22841/bundles/org.eclipse.core.resources/src/org/eclipse/core/resources/ResourcesPlugin.java#L492

When published, any DS components that are listening for the IWorkspace service will be activated.

The problem is that they are activated on the same thread, and in series. So if there are 10s or 100s of services that wait for the IWorkspace (and more will grow as more components move to DS) then the greater the apparent delay to the resources plugin. Already there are issues seen with the EGit setup being moved to DS components causing delays in start-up due to the Resources plugin, when in fact, it's not the Resources plugin fault.

To demonstrate, I have published an example, and will add a zip to this bug as well:

https://github.com/alblue/ResourceHog

When cloned and run in an Eclipse start-up, it will delay the launch of Eclipse by 5s. This is simulated with a Thread.sleep() but the reality is it's a placeholder for an unbounded amount of work that could happen in a component that the Resources plugin has no visibility over.

There are a number of approaches:

1. Do nothing, live with the delay
2. Move the call to 'context.registerService' into a new Thread(this::methodref).run() so it explicitly goes to a background thread
3. Move the work done in the DS component's start method, so that the Thead.sleep happens in a background thread
4. Work with DS so that the startup of the component is on an asynchronous, not a synchronous listener

Note that it's not always possible for 3. to be done correctly; even if a Job is used, the Job itself may have references to classes which trigger classloading activities. In fact, experience suggests so far that the majority of the delays in DS launched components are not in fact due to code complexity, but rather that of a tree of classes which are implicitly loaded during the Job's construction (i.e. with new Job(), not job.run()). You can double-job it (have a Job, whose purpose when it runs is to start a second Job) but that's not a sensible solution for the long term.
Comment 1 Alex Blewitt CLA 2021-02-27 18:28:14 EST
I should add that if we move the publication of the IWorkspace out to a separate thread, while we will trigger any outstanding DS components of the activation thread, we will still trigger them serially. So if we have 10 components waiting on the IWorkspace publication, we will move from having the main thread do resources + 10xDS components to having 2 threads, one for the resources and one for the remaining 10xDS components to do serially.

If DS were able to use a bounded thread pool to do startup, then each DS component could stay in its own thread or thread % pool size.

Having lots of components should be an embarrassingly parallel problem, but we are embarrassingly serial at the moment.
Comment 2 Alexander Fedorov CLA 2021-02-28 02:42:20 EST
(In reply to Alex Blewitt from comment #1)
> I should add that if we move the publication of the IWorkspace out to a
> separate thread, while we will trigger any outstanding DS components of the
> activation thread, we will still trigger them serially. So if we have 10
> components waiting on the IWorkspace publication, we will move from having
> the main thread do resources + 10xDS components to having 2 threads, one for
> the resources and one for the remaining 10xDS components to do serially.
> 
> If DS were able to use a bounded thread pool to do startup, then each DS
> component could stay in its own thread or thread % pool size.
> 
> Having lots of components should be an embarrassingly parallel problem, but
> we are embarrassingly serial at the moment.

It sounds like "the optimization should be applied to DS first". I that case IWorkspace publication could have minimal impact. Am I right?
Comment 3 Lars Vogel CLA 2021-02-28 06:55:19 EST
Maybe do both? Use separate thread and allow to (optionally?) activate DS components in parallel in Felix with a bound thread pool. I think Equinox already supports optional parallel bundle activation adding this also for ds components would be great.
Comment 4 Thomas Watson CLA 2021-03-02 17:39:38 EST
(In reply to Alex Blewitt from comment #1)
> I should add that if we move the publication of the IWorkspace out to a
> separate thread, while we will trigger any outstanding DS components of the
> activation thread, we will still trigger them serially. So if we have 10
> components waiting on the IWorkspace publication, we will move from having
> the main thread do resources + 10xDS components to having 2 threads, one for
> the resources and one for the remaining 10xDS components to do serially.
> 
> If DS were able to use a bounded thread pool to do startup, then each DS
> component could stay in its own thread or thread % pool size
> 
> Having lots of components should be an embarrassingly parallel problem, but
> we are embarrassingly serial at the moment.

The SCR implementation just steals time on the thread that publishes the service event.  Changing this has the potential to make start-levels meaningless if you simply move the activation of immediate components to be asynchronous.  Right now when bundles are activated we know that SCR has processed all the components for the bundle.  By the exit of Bundle.start SCR will have published and enabled all components it should in reaction to the bundle being started.

Equinox did enhance the start-level implementation such that it can activate bundles in parallel.  This gives SCR more threads to do the work on for many bundles.  But still the thread that is starting a bundle will be used to fire the service event and SCR "fully processes" that event in the bundle starting thread.  Equinox then makes sure all the parallel threads activating bundles for a particular start-level are done before moving onto the next start-level. This ensures the components from a previous start-level are "ready" before moving to the next start-level.

SCR can not simply push that work to the background and allow Bundle.start to exit while it is still doing work.  This would enable the framework to blast through all the start-levels and activate all the bundles while the SCR worker threads could still be processing events from start-level 1. It would need to "join" with the activating thread that published the event after it has done all the processing of the event in parallel threads.  This really would only make sense to do for activation immediate components.  Components that are non immediate should only get activated upon their first get from the service registry (lazily). This has to happen synchronously with the BundleContext.getService call.

Alex, is this what you are suggesting SCR should do?  As each immediate component gets enabled, queue its activation for parallel work.  Once it is determined all the immediate components have been queued for activate work wait for them to complete before the SCR service listener implementation returns control back to the framework.

This may be possible, but it may be a large effort to get it right.  Especially given that each component bundle has its own service listener tracking services (implemented and registered by SCR on behalf of the component bundle). Analysis will be needed on the activate code for immediate components to determine how difficult it would be to coordinate that.

I think this corresponds to this code in SCR:

https://github.com/apache/felix-dev/blob/org.apache.felix.scr-2.1.26/scr/src/main/java/org/apache/felix/scr/impl/manager/AbstractComponentManager.java#L758-L787

Someone could investigate in a prototype there to see if this is worthwhile.
Comment 5 Thomas Watson CLA 2021-03-02 17:42:31 EST
(In reply to Thomas Watson from comment #4)
> Someone could investigate in a prototype there to see if this is worthwhile.

Maybe a good GSOC project?
Comment 6 Alex Blewitt CLA 2021-03-19 15:33:46 EDT
(In reply to Thomas Watson from comment #4)
> (In reply to Alex Blewitt from comment #1)
> > I should add that if we move the publication of the IWorkspace out to a
> > separate thread, while we will trigger any outstanding DS components of the
> > activation thread, we will still trigger them serially. So if we have 10
> > components waiting on the IWorkspace publication, we will move from having
> > the main thread do resources + 10xDS components to having 2 threads, one for
> > the resources and one for the remaining 10xDS components to do serially.
> > 
> > If DS were able to use a bounded thread pool to do startup, then each DS
> > component could stay in its own thread or thread % pool size
> > 
> > Having lots of components should be an embarrassingly parallel problem, but
> > we are embarrassingly serial at the moment.
> 
> The SCR implementation just steals time on the thread that publishes the
> service event.  Changing this has the potential to make start-levels
> meaningless if you simply move the activation of immediate components to be
> asynchronous.  Right now when bundles are activated we know that SCR has
> processed all the components for the bundle.  By the exit of Bundle.start
> SCR will have published and enabled all components it should in reaction to
> the bundle being started.
> 
> Equinox did enhance the start-level implementation such that it can activate
> bundles in parallel.  This gives SCR more threads to do the work on for many
> bundles.  But still the thread that is starting a bundle will be used to
> fire the service event and SCR "fully processes" that event in the bundle
> starting thread.  Equinox then makes sure all the parallel threads
> activating bundles for a particular start-level are done before moving onto
> the next start-level. This ensures the components from a previous
> start-level are "ready" before moving to the next start-level.
> 
> SCR can not simply push that work to the background and allow Bundle.start
> to exit while it is still doing work.  This would enable the framework to
> blast through all the start-levels and activate all the bundles while the
> SCR worker threads could still be processing events from start-level 1. It
> would need to "join" with the activating thread that published the event
> after it has done all the processing of the event in parallel threads.  This
> really would only make sense to do for activation immediate components. 
> Components that are non immediate should only get activated upon their first
> get from the service registry (lazily). This has to happen synchronously
> with the BundleContext.getService call.
> 
> Alex, is this what you are suggesting SCR should do?  As each immediate
> component gets enabled, queue its activation for parallel work.  Once it is
> determined all the immediate components have been queued for activate work
> wait for them to complete before the SCR service listener implementation
> returns control back to the framework.

There's really two sorts of issues here.

Firstly, when a service gets published, it triggers synchronous setting of that service with other bundles. Typically this involves a simple 'set' call with a property setter on another class, but it may trigger a class load which auto-triggers a bundle.start.

Secondly, once a component is available to start (i.e. it's immediate and it has all of its dependencies satisfied) then it will be serially started. If publication of service A causes components B1, B2, B3 to start, then even if all of them are independent then they'll still start up in sequence.

On an embedded system with a single core, this makes sense; it's the most efficient way of doing it. However, on modern systems we have multi-cores and we could have B1/2/3 starting up in parallel. We have a parallel classloader, but it doesn't help if the things that are initiating class loading are launched in series :)

I've put up a demo workspace at https://github.com/alblue/ResourceHog/archive/refs/heads/ds-slow.zip which has a bundle that creates a service that triggers 5 (otherwise identical) components to start. You can see that the 'set' calls for the service are serialised, as are the activation calls.