Bug 232426 - OutOfMemoryError during Project Refresh due to cyclic symbolic links
Summary: OutOfMemoryError during Project Refresh due to cyclic symbolic links
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.3.1   Edit
Hardware: All Linux-GTK
: P3 critical (vote)
Target Milestone: 3.4.1   Edit
Assignee: Szymon Brandys CLA
QA Contact:
URL:
Whiteboard:
Keywords: helpwanted
: 185509 (view as bug list)
Depends on: 105554 185247
Blocks: 251159
  Show dependency tree
 
Reported: 2008-05-15 18:31 EDT by Martin Oberhuber CLA
Modified: 2008-10-16 19:16 EDT (History)
19 users (show)

See Also:


Attachments
Filesystem strucutre causing endless recursion in refresh (430 bytes, application/x-compressed)
2008-05-15 18:31 EDT, Martin Oberhuber CLA
no flags Details
Drawing showing the problematic symlink structure (33.57 KB, image/jpeg)
2008-05-15 18:38 EDT, Martin Oberhuber CLA
no flags Details
.project file with linked resources (450 bytes, text/plain)
2008-05-15 22:17 EDT, Matt McCutchen CLA
no flags Details
Picture to illustrate the comment (50.17 KB, image/png)
2008-05-27 10:40 EDT, Szymon Brandys CLA
no flags Details
Patch adding a Unittest to core.tests.resources (10.74 KB, patch)
2008-08-11 06:07 EDT, Martin Oberhuber CLA
Szymon.Brandys: iplog+
Details | Diff
Patch fixing the issue (6.55 KB, patch)
2008-08-12 17:14 EDT, Martin Oberhuber CLA
john.arthorne: iplog+
Details | Diff
Alternate patch (8.34 KB, patch)
2008-08-20 12:15 EDT, John Arthorne CLA
no flags Details | Diff
Patch adding a suppressLinkedDuplicates hidden Preference (5.08 KB, patch)
2008-10-14 07:43 EDT, Martin Oberhuber CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Oberhuber CLA 2008-05-15 18:31:36 EDT
Created attachment 100568 [details]
Filesystem strucutre causing endless recursion in refresh

+++ This bug was initially created as a clone of Bug #105554 +++

Build ID: Eclipse-3.3.1

A customer found a problem with creating an Eclipse Project on a file system structure that contains a particular structure of cyclic symbolic links, on Linux. The problem occurs in spite of the improvements made with bug 105554, and is 100% reproducable with Eclipse 3.3.1 as well as the latest Ecilpse 3.4 builds.

Attached is an extremely stripped-down version of the file system structure that produces the problem: just extract attached tgz archive, and create an eclipse project on it. On this very small example, Eclipse does not run into the OutOfMemoryError, but it takes a very long time to import the project with just 4 folders and 4 symlinks. If the example is made just a little bit larger (by adding some directories), the OutOfMemoryError will occur.

This example comes from an existing real-world project and poses a major limitation on Eclipse, because it's unusable on this file system structure and the file system structure cannot be changed because it's required by the build system.
Comment 1 Martin Oberhuber CLA 2008-05-15 18:38:44 EDT
Created attachment 100570 [details]
Drawing showing the problematic symlink structure

Attached drawing shows how the problematic structure of symbolic links in the file system leads to a "double-eight" kind of path through the file system. In this special case, the algorithm for detecting symbolic link cycles from bug 105554 fails.

But it turns out that there is a relatively simple solution for the problem: If the existing patch from bug 105554 attachment 65778 [details] is applied, the problem is solved. This patch still applies without merge issue on Eclipse 3.4 HEAD, and it adds a Preference that gets rid of duplicate resource system nodes which are introduced by symbolic links.

Unless we find a better solution for the problem, I would strongly request considering this patch (or a UI-less variant of it) in order to make sure that products can tune the Resource System such as to ignore duplicate resources brought in by symbolic links. The patch is proven to work on this particular example, our customer is happy with the patch, and the patch is guaranteed to affect performance only by making Eclipse faster.

The patch's only drawback is that it introduces a change of behavior of Eclipse in the case where a symbolic link brings in a file system node as a duplicate, e.g.

  foo
    bar1 --> ../bar
    bar2 --> ../bar

In this example, with the Preference from the patch switched ON, node bar2 will be ignored because it just brings in a duplicate of file system node ../bar.

As an alternative to the patch, we might try finding an improved algorithm for detecting symbolic link cycles that can deal with the concrete problematic case; but that's a lot of effort, and might be a lot more risky than adding the concrete, proven Preference from the patch.
Comment 2 Martin Oberhuber CLA 2008-05-15 18:49:06 EDT
CQ:WIND00121843
Comment 3 Matt McCutchen CLA 2008-05-15 22:13:09 EDT
The diagram does not match the tarball: the "openssh" symlinks should point to the "openssh" directory, not its children.
Comment 4 Matt McCutchen CLA 2008-05-15 22:17:26 EDT
Created attachment 100582 [details]
.project file with linked resources

One workaround is to mask the problematic resource subtrees with linked resources that point to an empty (or nonexistent) directory.  For example, if I replace the .project from the tarball with the attached one, I can import just fine.
Comment 5 Martin Oberhuber CLA 2008-05-16 03:56:03 EDT
Thanks for your ideas Matt, but your workaround is not acceptable for our customers because the problem appears already when creating the project (New > Project > Use default location:NO, specify location of problematic file tree).

We do not have the possibility here to override the .project file with any linked resources to mask the symlinks, since the .project file is just being created. One needs to be a deep Eclipse insider to apply this workaround, and our users are not. I'm actually marking this issue as Critical now, since a non-expert user may easily fall into this issue doing a normal operation, which makes the entire Workbench crash eventually.

You are right that the "openssl" links just point to ../openssl; but the drawing is correct, since it depicts exactly these links from the left and right children upwards to the middle opensl folder. From there, the normal subdirectory walks down into the child level again. The problem here is, that the Eclipse Resource Tree walks the left "lib" and the right "ssl" subfolders in an alternating manner.

Here is the full TAR listing from the problematic testcase for easy review:

$ tar tfvz ssh5.tgz
drwxrwxr-x mober/users       0 2008-05-06 21:12 ssh5/
drwxrwxr-x mober/users       0 2008-05-06 21:10 ssh5/openssl/
lrwxrwxrwx mober/users       0 2008-05-06 21:09 ssh5/openssl/ssh -> ../lib
lrwxrwxrwx mober/users       0 2008-05-06 21:10 ssh5/openssl/ssl -> ../ssl
-rw-rw-r-- mober/users     213 2008-05-06 21:12 ssh5/.project
drwxrwxr-x mober/users       0 2008-05-06 21:09 ssh5/lib/
lrwxrwxrwx mober/users       0 2008-05-06 21:09 ssh5/lib/openssl -> ../openssl
drwxrwxr-x mober/users       0 2008-05-06 21:10 ssh5/ssl/
lrwxrwxrwx mober/users       0 2008-05-06 21:10 ssh5/ssl/openssl -> ../openssl
Comment 6 Philipe Mulet CLA 2008-05-16 04:39:55 EDT
Is there anything planned for 3.4 ?
Comment 7 Szymon Brandys CLA 2008-05-16 08:45:27 EDT
(In reply to comment #6)
> Is there anything planned for 3.4 ?

I'm on it. I will mark it for 3.4, with the proviso that it can be postponed to 3.4.1.


Comment 8 Szymon Brandys CLA 2008-05-27 10:40:38 EDT
Created attachment 102156 [details]
Picture to illustrate the comment
Comment 9 Szymon Brandys CLA 2008-05-27 10:49:22 EDT
(In reply to comment #0)
> Attached is an extremely stripped-down version of the file system structure
> that produces the problem: just extract attached tgz archive, and create an
> eclipse project on it. On this very small example, Eclipse does not run into
> the OutOfMemoryError, but it takes a very long time to import the project with
> just 4 folders and 4 symlinks. If the example is made just a little bit larger
> (by adding some directories), the OutOfMemoryError will occur.

I noticed that this is rather problem with import, not refresh. 

See Nortel_symlinks_ssh5 project on the attached picture. Import creates infinitely nested structure and this hangs Eclipse.

If we just copy the file system structure from tgz, create a project using the structure and refresh the project, it will do it quite fast. However the structure in the explorer doesn't reflect the structure underneath.
Comment 10 Martin Oberhuber CLA 2008-05-27 13:24:27 EDT
(In reply to comment #9)

It is a refresh issue: the point here is that our customer needs to work on a "live" version of their source code, which does happen to have the unfortunate recursive symbolic link structure. 

They do need to have their "live" project in the Eclipse Resource tree. Once it is in the resource tree, the Refresh operation (UnifiedTree algorithm) fails. Copying stuff is not an option for them, since the Copy operation gets rid of symbolic links and thus makes their build system (which is outside Eclipse) break.

> If we just copy the file system structure from tgz, create a project using the
> structure and refresh the project, it will do it quite fast.

That's no surprise since it seems that the "Copy" operation converts symbolic links into folders, thus breaking the recursive structure. As said before, this is not an option for our customer since they need to work on the real file system structure.

Comment 11 Matt McCutchen CLA 2008-05-27 13:34:09 EDT
(In reply to comment #10)

No, if I understand Szymon correctly, he is talking about creating an empty project and then extracting the tarball into it, which preserves the symlinks.  I can reproduce the behavior from comment #8 with Eclipse 3.4M7: if I import "Existing Projects Into Workspace" or create a "New Project" on an already-extracted directory, then Eclipse recurses, but if I extract the structure (including symlinks) into an already-created project, then Eclipse correctly breaks the recursion.  In each case Eclipse is performing a refresh, but for some reason one refresh recurses and the other doesn't; we should try to figure out why.
Comment 12 Martin Oberhuber CLA 2008-05-27 14:17:38 EDT
Hm... that's interesting. Anyways, based on the reports that I've got I think that my attached 4-folder/4-symlink example is just the top of the iceberg. 

It's likely by chance that Refresh works on this mini-example but fails due to the same reasons as Import as soon as either (a) the structures are slightly larger, or (b) the symbolic links point from under the project root to some outside location.

Thinking about this again, and about the way my original algorithm from bug 105554 works, it's likely that the Import just fails because the entire structure is still outside any project root; this would mean that the refresh issue can be triggered from below a project root easily, just by sym-linking the "outside" structure into the project e.g.

project
  foo -> /outside/symlinks_ssh_5

I hope to have time later this week to write a unittest and debug the issue. But even if we should be able to come up with an improved algorithm that fixes this particular issue, I'd be in favor of adding a non-UI Preference for the "Suppress Duplicate Resources" switch from attachment 65778 [details] since this example could just be the tip of an iceberg and other similar problems might arise later. The "Suppress Duplicate Resources" switch safely allows ISV's to configure their products as needed.
Comment 13 Matt McCutchen CLA 2008-05-27 15:25:06 EDT
(In reply to comment #12)
> It's likely by chance that Refresh works on this mini-example but fails due to
> the same reasons as Import as soon as either (a) the structures are slightly
> larger, or (b) the symbolic links point from under the project root to some
> outside location.

Or the difference might have to do with the refresh starting in a different place, as you mentioned in bug 105554 comment 41.

> I hope to have time later this week to write a unittest and debug the issue.
> But even if we should be able to come up with an improved algorithm that fixes
> this particular issue, I'd be in favor of adding a non-UI Preference for the
> "Suppress Duplicate Resources" switch from attachment 65778 [details] since this example
> could just be the tip of an iceberg and other similar problems might arise
> later. The "Suppress Duplicate Resources" switch safely allows ISV's to
> configure their products as needed.

An option to suppress duplicates would be helpful, but as you pointed out yourself in attachment 65205 [details], suppressing duplicates may break projects that expect resources to exist in certain places, e.g., for include paths.  IMHO, it would be better to improve the algorithm so it breaks more symlink cycles without having to suppress duplicates.
Comment 14 Martin Oberhuber CLA 2008-05-27 15:47:17 EDT
(In reply to comment #13)

> IMHO, it would be better to improve the algorithm so it breaks more symlink
> cycles without having to suppress duplicates.

Agreed, but I'd still like to have a non-UI Preference as a Safeguard such that Solution Providers can help themselves if needed. In our case, having the "suppressDuplicates" workaround available from last year saved us LOTS of costs engaging with the customer, since we could provide a (probably temporary) solution quickly.

Comment 15 Szymon Brandys CLA 2008-05-29 06:50:33 EDT
It doesn't look like a fix for 3.4. We have only one day to RC3 and I don't think that we have something stable to release.

I have to move it to 3.4.1.





Comment 16 Martin Oberhuber CLA 2008-08-08 12:05:56 EDT
Is this still realistic for 3.4.1? Is anybody working on this? - 
I'll be trying to add a unit test for this specific case, to get started.
Comment 17 Martin Oberhuber CLA 2008-08-11 06:07:25 EDT
Created attachment 109640 [details]
Patch adding a Unittest to core.tests.resources

Attached patch adds a unittest for this issue ("SymlinkResourceTest" in bundle
org.eclipse.core.tests.resources). The patch also pushes up a method for symbolic link creation into class CoreTest in org.eclipse.core.tests.harness.

It turns out that the problem only appears with BACKGROUND_REFRESH in

   project.open(IResource.BACKGROUND_REFRESH, getMonitor());

While the test is running as a PDE Unittest, one can nicely watch how the Refresh in background counts up to 890 resources due to the cyclic symbolic link structure. Since the test considerably slows down overall test execution, I have not yet added it to
   org.eclipse.core.tests.internal.localstore.AllTests

I'm now going to try and fix the issue until the test completes. Please let me know if anybody else has been working on this already.
Comment 18 Martin Oberhuber CLA 2008-08-11 06:41:55 EDT
The problem is, as John has mentioned in bug 105554 comment 53 already, that the RefreshJob which performs refreshes in the background, refreshes 2 levels of the tree only when it starts (and then adapts to varying number of levels, depending on the time how long the RefreshJob runs) -- see RefreshJob.java line 141:

   toRefresh.refreshLocal(1000 + depth, null);

So, in the initial run, the Refresh Job finds the 4 symbolic links at depth 2 of the tree and forks off 4 more refresh jobs, each of which start at a symbolic link (which should be detected as recursive already).

As John has mentioned, the only possible fix for this seems to be persisting the PrefixTable in UnifiedTree, which is used to detect symbolic links between invocations of the UnifiedTree#accept() method. 

I'm not exactly sure, though when we would need to throw away the Prefix table and create a new one:
  (1) Since the UnifiedTree which is used to walk over the resources is
      re-created below the individual refreshLocal() requests, we'll have to
      persist the Prefix table over multiple instances of UnifiedTree. 
  (2) We'll need to throw away the table when actual file system or resource 
      tree modifications have been made, so the table is probably not current
      any more.

Perhaps it would work to assume the table to always be bound to a single project at a time only, and register a Resource Change Listener to throw it away when any modifications occurs to that project (might be OK since the RefreshJob runs as a WorkspaceJob, so no modifications can occur during one refresh).

In order to account for file system changes which have not yet been picked up by the resource tree, we might also need to age the prefix table and throw it away after some given time has elapsed.

Given that the main reason for persisting the table is to keep it living while the RefreshJob is going on (which is a WorkspaceJob), perhaps the best solution is this:
  1. At the time the PrefixTable is created initially, also create a 
     WorkspaceJob to delete it.
  2. While current WorkspaceJob is still going on, fill the table (repeatedly).
  3. The "PrefixKiller" WorkspaceJob will be allowed to run only after the
     RefreshJob has completed, and it is guaranteed to always run.
So, the PrefixTable will always be deleted in time, but only after Refreshes
are complete.

Please comment on this proposal before I try to start implementing it.
Comment 19 Martin Oberhuber CLA 2008-08-12 17:14:08 EDT
Created attachment 109837 [details]
Patch fixing the issue

Attached patch fixes the issue. It works as follows:

* No changes if no symbolic links are involved.
* If symbolic links are involved, then at the time the pathPrefixHistory is 
  lazily created, the algorighm now checks whether it runs in context of the 
  RefreshJob:
  Job.getJobManager().currentJob().belongsTo(ResourcesPlugin.FAMILY_AUTO_REFRESH)
* If no, then the local pathPrefixHistory works exactly the same as before.
* If yes, then the pathPrefixHistory gets access to a single shared prefix
  history (guarded by a mutex). This shared prefix history is globally unique
  and remains persistent across invocations.
* A JobChangeLIstener is registered with the RefreshJob to clear the shared 
  history when the RefreshJob is done.
* Mutex guard to the shared prefix history is released in a ...finally...
  clause in UnifiedTree#accept().

This fix makes the Unittest run OK, and seems safe because only the behavior in the problematic case is ever changed. I also ran all org.eclipse.core.tests.resources.AutomatedTests and got a green result from all 909 of them.

On second thought the ...synchronized... Mutex for protecting the shared prefix history is probably not necessary, because the shared prefix history can always be accessed from a single Thread (the single RefreshJob) only. This holds true as long as there can ever only be one RefreshJob at any one time. The Guardians of the Workspace Team might want to decide whether they want to keep the Mutex in there.

Legal Message: I, Martin Oberhuber, declare that I developed attached code from scratch, without referencing any 3rd party materials except material licensed under the EPL. {I am authorized by my employer to make this contribution under the EPL.}

We would appreciate seeing this patch applied to 3.4.1.
Comment 20 John Arthorne CLA 2008-08-20 12:15:59 EDT
Created attachment 110464 [details]
Alternate patch

The approach of storing the shared prefix caches on UnifiedTree is a bit awkward since the lifecycle of these caches is really tied to the refresh job. This alternative places the caches directly on the refresh job and lets the refresh job take care of managing it. This avoids locking and contention issues, and overall seems cleaner and simpler. Note I haven't tested this as I'm currently running on an XP machine. Martin what do you think?
Comment 21 Szymon Brandys CLA 2008-08-20 12:34:17 EDT
It works on Linux.
Comment 22 Szymon Brandys CLA 2008-08-22 04:04:59 EDT
Martin's patch for the tests and John's alternate patch, both slightly modified released to R3_4_maintenance and HEAD.
Comment 23 John Arthorne CLA 2008-08-25 11:56:15 EDT
Comment on attachment 109837 [details]
Patch fixing the issue

Adding the iplog flag here to acknowledge Martin's contribution. My patch was released but it was an evolution of this patch.
Comment 24 Martin Oberhuber CLA 2008-08-25 13:04:50 EDT
Thanks for reviewing, improving and applying the patch.

The alternate patch does improve lifecycle management and gives the theoretical possibility to have multiple RefreshJobs run in parallel (on disjoint parts of the workspace). It seems a bit odd in terms of dependency, though, to expose the PrefixPool (which is an implementation detail in UnifiedTree) into the RefreshJob.

I think that in an ideal world, there would be a method 

  ITreeState UnifiedTree#accept(IUnifiedTreeVisitor visitor, 
         int depth, ITreeState previousState)

for preserving state information right in the caller in case multiple consecutive calls are desired to perform a single operation. The state object would get disposed automatically when the caller no longer needs it. Since the ITreeState should be opaque to the caller, an Object could also be used instead of an (empty) ITreeState markup interface.

Semantics of this could actually be extended to other Workspace related operations: If an operation is to be limited to some depth only, then state information could optionally be carried over to subsequent invocations.

Anyways, if you are happy with the patch, I'm too. I'm going to test it now on some real-world scenarios which are more complex than the test cases we've worked on so far.
Comment 25 Martin Oberhuber CLA 2008-08-27 05:43:05 EDT
I tested this out of the Workspace on the most complex file tree structures I have (mostly from attachment 65204 [details]), and didn't find any excessive need of CPU time or memory. All structures could be created as project, and imported from either file system or archive.

Although still not perfect (see bug 185509, and bug 105554 comment 41 for specific known shortcomings) I think that this is likely the best we can do for now. Further improvements will likley need to make symbolic links first class citizens in the Resource model, and algorithms specifically designed for them. We are going to look at this in the context of http://wiki.eclipse.org/E4/Resources (and bug 209190 in the shorter run).

I'll mark this verified when I get hold of an actually released build with the fix.
Comment 26 Martin Oberhuber CLA 2008-08-27 05:51:32 EDT
*** Bug 185509 has been marked as a duplicate of this bug. ***
Comment 27 Martin Oberhuber CLA 2008-10-14 07:43:03 EDT
Created attachment 115023 [details]
Patch adding a suppressLinkedDuplicates hidden Preference

It turns out that the new patch committed to Eclipse 3.4.1 is not optimal in some cases -- in extreme cases, it takes too long until the symbolic link cycle is detected.

There was a discussion in bug 105554 comment 55 and onwards about an option of the old patch, which would allow suppressing symbolic links completely as soon as a symbolic link doesn't introduce any new resources. This option creates behavior which is not entirely backward compatible, so it should only be enabled on demand (projects with extremely nested networks of symbolic links).

Attached patch adds a Hidden Preference, by which Eclipse experts can modify the plugin_customization.ini in an Eclipse-based product to enable this behavior if desired.
Comment 28 Martin Oberhuber CLA 2008-10-16 19:16:19 EDT
Bug 251159 now follows up with the work from here.