Bug 233939 - Cannot find file using IWorkspaceRoot.findFilesForLocation()
Summary: Cannot find file using IWorkspaceRoot.findFilesForLocation()
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.4   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: 3.6 M6   Edit
Assignee: Martin Oberhuber CLA
QA Contact:
URL:
Whiteboard:
Keywords: helpwanted
: 276787 (view as bug list)
Depends on:
Blocks: 211160 244434 260516
  Show dependency tree
 
Reported: 2008-05-26 06:58 EDT by Markus Schorn CLA
Modified: 2010-02-11 03:51 EST (History)
15 users (show)

See Also:


Attachments
Test v01 (3.44 KB, patch)
2008-05-27 09:04 EDT, Szymon Brandys CLA
no flags Details | Diff
diffs between current 3.4 version and changed version (4.20 KB, text/plain)
2008-07-03 10:51 EDT, John Pruitt CLA
no flags Details
zip file with new source and diffs (20.87 KB, application/octet-stream)
2008-07-03 11:03 EDT, John Pruitt CLA
no flags Details
Complete patch including regression test (6.24 KB, patch)
2010-02-03 04:07 EST, Martin Oberhuber CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Schorn CLA 2008-05-26 06:58:47 EDT
Steps to reproduce:
1) create a project '<wsroot>/prj'.
2) create a file outside the project '/tmp/file.txt
3) create a symbolic link '<wsroot>/prj/file.txt -> /tmp/file.txt'

Now !!both!!
   wsRoot.findFilesForLocation("<wsroot>/prj/file.txt") and
   wsRoot.findFilesForLocation("/tmp/file.txt") 
return an empty array.
Comment 1 Martin Oberhuber CLA 2008-05-26 07:04:26 EDT
Bug 198291 seems related

To me, it seems odd that in FileSystemResourceManager#allPathsForLocation() an Util.canonicalURI() is called right away. It looks like it might make sense to first try and find a prefix for the location in the list of non-canonicalized project roots, and only if that fails, try to canonicalize the URI.

Writing a unit test for this issue should be quite simple.
Comment 2 Szymon Brandys CLA 2008-05-27 09:04:15 EDT
Created attachment 102141 [details]
Test v01
Comment 3 John Pruitt CLA 2008-06-27 10:49:36 EDT
I tried something like Martin's suggestion by changing allPathsForLocation. I took a slightly different approach than Martin's but it is the same basic idea. I assumed there was a good reason for the call to canonicalURI and let it try that path first. If that did not produce any results then I tried to relativize the inputLocation. This seems to work for the case I was interested in but I am not sure if there are any side-effects.

After some more testing I will try posting the changes made.
Comment 4 John Arthorne CLA 2008-06-27 11:17:46 EDT
The reason for using a canonical URI was mainly to handle case differences on Windows and mac (c:\tmp vs. c:\TMP). It also handles the case of a sym-link pointing into the workspace (as opposed to this case of a sym-link pointing out of the workspace). When projects and linked resources are created, their locations are immediately converted to canonical form. So, comparison using the canonical form usually makes sense.  

The approach of falling back to the non-canonical parameter location sounds reasonable. Of course it should short-circuit in the case where the canonical form is the same as the non-canonical form, to avoid doing the same search twice.
Comment 5 Gerhard Schaber CLA 2008-06-27 14:55:44 EDT
In the proposed solution, what would happen, if a folder is linked into projects with different paths?

Sample folder structure:
<wsroot>
   projectA
      folderA -> c:/temp
   projectB
      folderB -> c:/temp

Does it always return both resources (folderA and folderB), no matter if I lookup path <wsroot>/projectA, <wsroot>/projectB, or c:/temp?
Comment 6 John Pruitt CLA 2008-07-03 10:51:58 EDT
Created attachment 106447 [details]
diffs between current 3.4 version and changed version

diffs between current 3.4 version of org.eclipse.core.resources/src/org/eclipse/core/internal/localstore/FileSystemResourceManager.java and a proposed changed version.
Comment 7 John Pruitt CLA 2008-07-03 11:03:42 EDT
Created attachment 106449 [details]
zip file with new source and diffs

zip file containing:
FileSystemResourceManager_3.3.2.java - changed file for Eclipse 3.3.2 - tested in 3.3.2
FileSystemResourceManager.java - proposed file for Eclipse 3.4 - untested
FileSystemResourceManager.diffs - diffs between the 3.4 proposed version and 3.4 base
Comment 8 John Pruitt CLA 2008-07-03 11:06:42 EDT
I added a zip file containing changes to the org/eclipse/core/internal/localstore/FileSystemResourceManager.java file. It has been tested in Eclipse 3.3.2 but not Eclipse 3.4. With regard to comment #5, I believe that in all cases, it will return both resources. It tries the canonical path first and that should be found in both projects.
Comment 9 Martin Oberhuber CLA 2008-10-16 12:36:46 EDT
The proposed patch addresses the very important case where a file originally referenced with its resource path cannot be tracked back from its location. We might consider applying this patch since it doesn't cost much and does give some benefit.

I have some ideas for a more complete solution though. It could basically go like this:

Right now, each project is assumed to span a (canonical) file system tree below its root. This assumption is not correct in the case where symbolic links point into other trees from below that root: in this case, there are some discontinuities in the file system tree being spanned.

I think what we need is a data structure, that tracks all those (canonical) file system roots being spanned. Creating that data structure is relatively easy: whenever we encounter a symbolic link when traversing the file system, we need to check whether its link target (canonical path) spans a new tree, and add it to the data structure. From that data structure, we need a reference back to the folder spanning the tree.

Once we have such a data structure of (canonical) file system roots being spanned, we can easily take the canonical path of any location, and check under which file system root(s) it is found. Tracking back from the roots into the projects/symlinks spanning those roots, we can compute all the resource paths which produce that location.

I think that the problem (and hence the proposed data structure) is likely similar to what the Eclipse AliasManager already does for linked resources, so that work for enhancement might go under bug 198291.

It's interesting that a really correct implementation of findFilesForLocation() can also fix the problem of cyclic symbolic links easily (which is still not perfectly solved (see bug 232426 comment 25, bug 185509, and bug 105554 comment 41): just take the result of findFilesForLocation() for a symbolic link target, and if the result contains an entry that is an ancestor of the current resource, you have a cycle (i.e. its resource path is a prefix of the current resource path).

I still need to sort out some of the details, like life cycle (what happens in case the file system changes). I'm currently thinking that we should do all computations in terms of the "snapshot" of the file system that's stored in the Workspace Tree (ResourceProxy). If we do that, then a prerequisite for making such an algorithm work, is that we store additional data with the ResourceProxy nodes (potentially using subclasses of ResourceProxy, for memory efficiency):
  * type of resource -- extend to "symlink to file", "symlink to folder"
  * canonical path of link target -- only in case it's a symbolic link
Notably, storing the canonical path of the link target in ResourceProxy (which gets persisted AFAIK) makes the workspace not movable to a different machine.

I'm willing to invest in that work (assigning the bug to me), and I'm posting the proposed strategy here for comments. Any thoughts?
Comment 10 John Arthorne CLA 2008-10-16 17:53:19 EDT
comment #0 lists two paths:

 wsRoot.findFilesForLocation("<wsroot>/prj/file.txt") and
 wsRoot.findFilesForLocation("/tmp/file.txt")

Which of these do we expect to return a non-empty list? It seems like a bug to me that we don't handle the first one, but I can see us returning an empty array for the second one. The file system entity we care about is at "<wsroot>/prj/file.txt" the fact that today it points to /tmp/file.txt feels like a file system level implementation detail. The problem with trying to solve this second one is that it requires keeping up to date when symlinks change their target, which I don't think we can do efficiently.
Comment 11 Martin Oberhuber CLA 2008-10-16 18:13:18 EDT
(In reply to comment #10)
The patch currently attched would fix the first one, but not the 2nd one.

The more advanced solution that I have in mind would also fix the 2nd one. The problem here is, that although it's a file system detail that the file physically resides at /tmp/file.txt today, we can end up seeing that path: for instance when a compiler builds an EXE from that file and embeds its location as debug info. Debugging the EXE, we get presented /tmp/file.txt and would like to find the resource linked to that entity.

> The problem with trying to solve this second one is that it requires keeping
> up to date when symlinks change their target

That's basically right, but my approach would be that changed symlink targets would get updated during a workspace refresh only. That is, the symlink target would be stored as part of persistent IResource state, such that we can detect changes of link targets during refresh, and act as needed This could arguably lead to inconsistencies in our model of the file system, but that's expected with resources not in-sync with the file system.

The good point of using canonical path for this (rather than trying to follow link targets and resolving paths ourselves) is that we only need to care about those IResource nodes that happen to be a symbolic link, and read their canonical path. What we get is a mapping between our contiguous model of the file system, and the reality which may not be contiguous.

Even if a full add/remove lifecycle for symlink targets should end up being too difficult in the data structure (which I don't think would be the case), we could always re-generate the data structure from scratch (like during a full refresh). If we end up not persisting the link targets (which might be an option), we could still persist the flag what IResource nodes are symbolic links -- and then, re-read the canonical path of those nodes during workspace startup / refresh.
Comment 12 John Arthorne CLA 2008-10-20 09:50:50 EDT
Ok. It would definitely be interesting if it also addressed bug 198291 at the same time. One implementation note: AliasManager doesn't store any state across sessions - it is rebuilt at the start of each session. I could imagine doing the same here, just storing a bit in the ResourceInfo that the resource is a symlink, and then rebuilding the location map each session (and on a full refresh). This has a nice self-correcting attribute that corruption or staleness in the location data can be cleaned up by restarting or refreshing.

Two other interesting optimizations in AliasManager - it is optimized for the case of a workspace with no links, and for projects with no links, so that users without links don't pay any cost. The nonDefaultResourceCount and aliasedProjects fields are used to optimize these cases. I think it's important to have these kinds of optimizations for symlinks too, so that a workspace/project with no symlinks pays minimal extra cost.
Comment 13 Martin Oberhuber CLA 2008-10-20 10:19:51 EDT
Thanks John. I did have bug 198291 in mind, and I also thought about the other two points already (not storing state across sessions, and optimizing for the non-(sym)link case.

I don't think, however, that one bit in the ResourceInfo is sufficient. If we store just one bit in the ResourceInfo, then on startup, we'll need to walk over all nodes marked as "file system alias" and read the link target (canonical path) from the file system.

This has two disadvantages: (1) performance degradation on startup, and (2) the risk of inconsistencies, because the 1-bit-info (symlink or not) would be as old as the last refresh operation whereas the link target (canonical path) would be new at the time it is read. I think it is better if we keep the system consistent such that the 1-bit is read and stored at the same time as the link target.

This means that for file system aliases (aka symbolic links) we'll not only need to store and persist the 1 bit in ResourceInfo, but also the link target (canonical path). The internal state of our file system information would always be a snapshot taken at the time the last refreshLocal was made.

Apparently, adding this information into the workspace tree (ResourceInfo) is the first thing that needs to be done, so I'd like to get consensus on this.
Comment 14 John Arthorne CLA 2008-10-20 16:09:01 EDT
Good point about the synchronization issue if we didn't write out the complete link target. It would make the solution a bit more complex, but it's certainly possible to write out this information. At least there is still the option to do a full workspace refresh if this data became out of date.

If the data was written directly in the .tree file along with the rest of ResourceInfo, it would require the file format version to be incremented, making it impossible to "go back" to an older build or release once written in the new format. We've done this before, but not since before the 1.0 release (See WorkspaceTreeReader and subclasses).

Another option is to write the data as a persistent property on the IResource. This has the advantage of not altering the workspace tree file format.
Comment 15 Martin Oberhuber CLA 2008-10-20 17:51:08 EDT
Storing as a persistent Resource Attribute sounds like a good idea, if unpersisting all of them can be done with good performance. Otherwise, a single separate file in addition to the .tree file (e.g. .tree2) just for the link targets might be an option.

One thing to note about persisting the link targets is that it may make the Workspace unmovable to a different location (when we persist absolute canonical paths) -- this can already happen when opening an NFS shared workspace on a different machine.

I'm not exactly sure how to address this. Some options:

1- Persist link targets as relative path when the original link was specified
   as a relative path? -- might be problematic with chains of symbolic links
   (a -> ../b -> /foo/bar)
2- Like (1), but keep an exact model of the FS in memory and persisted: remember
   all nodes in a chain of symbolic links, some might be relative others
   absolute. On restart, check existence of each node, or perhaps only those
   nodes that contain link targets specified as absolute path
3- Persist link targets as absolute canonical path, and check their existence
   when re-opening persisted data on startup. If not existent, mark resource
   as dirty (not-in-sync with the file system) or do a forced re-read from the
   file system.
4- Relativize all link targets that are inside the workspace
5- Dont persist link targets after all, but re-read them from the file system
   on startup. Might pose the inconsistency problem mentioned before.

My personal feeling is that we should persist link targets; this is in line with the general way how the workspace works, especially for slow remote locations we have a cached local copy of file stat. I think we should model the file system as accurately as we can, so I think I'm in favor of (2). Doing this might mean that we'll want to try and compute a canonical path ourselves at times.

Is the matter of having the workspace "portable" to a different root/location even an issue? What do others think? I think that I've come across this issue before, related to e.g. ClearCase where users would like to re-use the same workspace on /view/view1/vobs/x and /view/view2/vobs/x.
Comment 16 John Arthorne CLA 2008-10-23 15:51:12 EDT
Workspace portability is important. In disk storage we typically deal with project relative paths where possible. This enables workspace move, and also moving of projects without having to rewrite files on disk.

Having said that, this case is a bit different, because we are getting back an absolute path from File.getCanonicalPath() and we don't even know the relative path the symlink is using. The best I can think of is to convert this canonical path to a project-relative path where possible. If not below the project, persist an absolute path.  My fear though is that this will not handle all moves properly, since we would be persisting a different relative path, relative to a different base location, from the symlink itself.
Comment 17 John Arthorne CLA 2009-05-18 22:36:45 EDT
*** Bug 276787 has been marked as a duplicate of this bug. ***
Comment 18 Martin Oberhuber CLA 2010-02-03 04:07:24 EST
Created attachment 158010 [details]
Complete patch including regression test

Attached patch fixes the original issue case 1 from the description. This is a much simplified code for the algorithm proposed and tested by John Pruitt in comment 3, and approved by John A in comment 4 and comment 10.

Rather than waiting for a more general solution which would also fix case 2, I'm going to commit this since it satisfies the submitter's original needs as well as Szymon's regression test. I slightly modified the regression test to re-use existing createSymbolicLink() and avoid running the test when symbolic links are not supported.

For fixing case 2, it's going to be necessary to track all symbolic links in the workspace in some data structure and keep them up-to-date. This matches the description of bug 198291, so I'm going to track it there.

Two more special cases should be covered by the Unittest:
  (a) 2 symbolic links to the same location should return 2 files (comment 5)
  (b) find files when the project location itself is on a symbolic link, i.e.
      the canonical path of the project location != the original path of it.
Comment 19 Martin Oberhuber CLA 2010-02-04 12:49:25 EST
(In reply to comment #18)
I just committed improvements to Szymon's regression test, to ensure that the test cleans up its temporary files when done. I also added a testcase for
comment 5, but found out that this also needs to be addressed by bug 198291, so that additional test is currently disabled.

I'm marking this FIXED since the original problem is fixed, and I do not have time now to write the additional test case for

>   (b) find files when the project location itself is on a symbolic link, i.e.
>       the canonical path of the project location != the original path of it.

If that turns out being an issue, then it'll need to be addressed by storing the project location in both its canonical form as well as its originally specified form. But that can also be tracked with a separate bug if the issue arises.