Bug 301563 - Fast project import from snapshot data
Summary: Fast project import from snapshot data
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Resources (show other bugs)
Version: 3.5   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: 3.6 M6   Edit
Assignee: Martin Oberhuber CLA
QA Contact:
URL:
Whiteboard:
Keywords: api
Depends on:
Blocks: 263671 309218 303751 305716 305718 306573 306575
  Show dependency tree
 
Reported: 2010-02-02 13:04 EST by Martin Oberhuber CLA
Modified: 2019-08-19 08:56 EDT (History)
10 users (show)

See Also:


Attachments
Patch to org.eclipse.core.resources plugin (12.02 KB, patch)
2010-02-15 17:14 EST, Francis Lynch CLA
no flags Details | Diff
Patch to org.eclipse.core.tests.resources plugin (5.91 KB, patch)
2010-02-15 17:15 EST, Francis Lynch CLA
no flags Details | Diff
Patch to org.eclipse.ui.ide plugin (21.92 KB, patch)
2010-02-18 01:14 EST, Francis Lynch CLA
no flags Details | Diff
Patch v2 to org.eclipse.core.resources plugin (12.15 KB, patch)
2010-02-20 18:09 EST, Francis Lynch CLA
no flags Details | Diff
Patch v3 to org.eclipse.core.resources plugin (2.29 KB, patch)
2010-02-26 12:39 EST, Francis Lynch CLA
no flags Details | Diff
Patch v2 to org.eclipse.core.tests.resources plugin (6.36 KB, patch)
2010-02-26 12:41 EST, Francis Lynch CLA
no flags Details | Diff
Patch v4 to org.eclipse.core.resources plugin (19.98 KB, patch)
2010-03-01 08:58 EST, Francis Lynch CLA
no flags Details | Diff
Patch v3 to org.eclipse.core.tests.resources plugin (12.74 KB, patch)
2010-03-02 11:44 EST, Francis Lynch CLA
no flags Details | Diff
Patch v5 to org.clipse.core.resources plugin (22.16 KB, patch)
2010-03-04 01:12 EST, Francis Lynch CLA
mober.at+eclipse: iplog+
Details | Diff
Amendments to core.resources patch v5 (11.10 KB, patch)
2010-03-04 18:30 EST, Martin Oberhuber CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Oberhuber CLA 2010-02-02 13:04:25 EST
+++ This bug was initially created as a clone of Bug #263671 +++

This bug is a subset of the proposal on bug 263671. We intend to be contributing an implementation for Eclipse 3.6, pending committer's 
general acceptance of the proposal below.

Overview:
---------
Today, importing a project into a workspace forces a deep project refresh, which may be very slow depending on the file systems being used (especially if there are some remote resources linked in through EFS).

For very large projects with content that doesn't change often, it should be possible to create the refresh snapshot once and then share this snapshot data among users. This should allow fast project import without forcing the refresh - basically importing a project database. Conceptually, this makes the "import-project" operation work the same as an "open-project" operation on a workspace where the project had been open before and the snapshot data is therefore available.


Background / User Story:
------------------------
As a developer working on a large CDT project in a ClearCase dynamic view, I want to quickly import the project into my workspace without waiting for Workspace Refresh, such that I can start using a team shared CDT index and editing files immediately.

Today, my project consist of around 65000 files, and just doing the project refresh which is compulsory on initial project import takes around 10 minutes during which I cannot use Eclipse. I have some build machines which re-create a CDT index every night, so when I import the project on some branch the next morning, I expect to be working immediately (even if the stored index data may not be 100% accurate).


Implementation Proposal:
------------------------
1. Create a new UI Wizard for "Export > Team > Project Snapshot" which is 
   capable of creating a snapshot file from current resource state. Code from
   the attachments on bug 263671 can be used for this. Potentially, allow
   additional contributors (like CDT shared index) to take part in creating
   the snapshot.
2. ONLY when importing a project into a workspace, AND the snapshot file is 
   found in a known (but configurable) location, load the snapshot file to
   initialize refresh data rather than performing a project refresh.

Since the snapshot data is only loaded on project import, there is no issue of merge conflicts. Since there is no UI or option for loading or not loading the file, this happens transparently to any clients above core/resources. For clients, it appears as if all the resources get refreshed/imported extremely quickly.

Like for the CDT team-shared index, the default index location would be
   .settings/resource-index.zip
so when a project is imported, the importer would look here for the file. It should be possible to configure a different index location using Eclipse Variables, in this case the custom index location would be dumped into project preferences in .settings/org.eclipse.core.resources.prefs
  #Tue Feb 02 18:59:13 CET 2010
  eclipse.preferences.version=1
  snapshotImportLocation=${someVar}/resource-index.zip
Comment 1 Remy Suen CLA 2010-02-02 16:36:44 EST
The zodiacs of my keyboard and the scroll wheel must've aligned. Pardon the erroneous.
Comment 2 Dani Megert CLA 2010-02-03 11:10:50 EST
I discussed this with Martin at our PMC call: we think the win is big but we
first want to have a patch at hand to see whether it is too big or too risky to
put into M6 (note that M5 was feature freeze for major features or features
that impact other components). Martin confirmed that they will invest the
required time to polish the feature when requested. Martin will also talk to
Sharon about the IP review which has its deadline for new CQs on February 5 -
we don't want to invest time if it later fails to pass the IP review for 3.6.
Comment 3 Martin Oberhuber CLA 2010-02-04 21:37:21 EST
I have initiated the IP due diligence process, and we'll be working on an initial patch ported to HEAD for review by Tuesday Feb 17. Getting the original patch reviewed through https://dev.eclipse.org/ipzilla/show_bug.cgi?id=3773 will be sufficient to continue work here without further review, since this will be "under the supervision of the PMC". We are going to open a CVS branch for continued review and polishing once the initial contribution is in.

For reference, here are some acceptance criteria that we discussed on the PMC (and some more that I think I can commit to):

* There must be proof that the contribution does in fact improve performance.
* The contribution must work and be valuable as-is, without putting pressure on 
  add-ons like CDT or JDT to actually implement extensions to make it valuable.
* There must not be any performance degradation when the contribution is
  not used (no "import-from-snapshot").
  
I'm adding an "api" keyword since we will require API to 
1. Allow exporting the project snapshot data from headless automated builds
2. The file format of the exported snapshot data constitutes API to be 
   maintained moving forward (workspace compatibility)
3. There is a desire to allow extenders like CDT contribute snapshot data for
   indexing on export/import. We may want an extension point to support this,
   although this may not be strictly required (the contribution will likely be 
   valuable without the extension point).
Comment 4 Francis Lynch CLA 2010-02-15 17:14:19 EST
Created attachment 159129 [details]
Patch to org.eclipse.core.resources plugin
Comment 5 Francis Lynch CLA 2010-02-15 17:15:15 EST
Created attachment 159130 [details]
Patch to org.eclipse.core.tests.resources plugin
Comment 6 Francis Lynch CLA 2010-02-15 17:30:05 EST
I have attached first drafts of patches to core.resources (changes to support the new snapshot refresh feature) and core.tests.resources (a Junit test to test the feature). The unittest creates a project from a set of files at d:\bigsite, timing the intial open which would include a project refresh; it then creates a project refresh snapshot, closes and deletes the project, and opens the project again, this time getting the refresh information from the snapshot file. Finally, it closes and deletes the project and reopens it with a standard refresh again. I tested this with a set of 9799 files and got timings as follows: Original open: 1701, Snapshot open: 234, and Second refresh open: 905. The performance improvement seems to be substantial even on the fast file system I was using; on one with a lot of overhead (such as a ClearCase view) the benefits should be even greater.

Open questions remaining in the design include how to configure the location of the refresh snapshot file, currently hardcoded at .settings/resource-index.zip. Also, the possibility of a project on a remote file system needs to be handled properly. Please provide comments about other things that may have been overlooked.
Comment 7 Martin Oberhuber CLA 2010-02-16 13:01:31 EST
Thanks for the patch, Francis!

From my initial review, I believe that this is ready to be committed into a branch if this is what other reviewers prefer. I'd personally think that with just 12K the patch is small enough to be reviewed efficiently as a patch, so I'd not check it in yet unless somebody wanted me to.

IMO the next steps should be:

1.) Create a simple UI wizard to dump the snapshot, such that it's easy to
    do some exploratory testing on real world projects. I'd especially be
    interested in behavior related to large CDT projects, CDT team-shared
    index, and complex setups with linked resources.

2.) Discuss whether the hardcoded snapshot location 
      $PRJ_ROOT/.settings/resource-info.zip
    is good enough to be released, or we need some means to configure the
    location. My personal opinion is that the hardcoded location may be good
    enough, because
      - the snapshot should be "close" to the file system it describes,
      - in case the location is not writable, it's easy to put the .project
        file elsewhere and use a linked resource to access the data.
    I am aware though that the CDT's team-shared index functionality does
    allow changing the snapshot location so we could do something similar here.
    In case it is configurable, how could the configured location be stored?

3.) Add Unittests for "interesting" scenarios, to ensure that no API
    contract is violated by the feature.

I'd also like to see initial comments from Szymon and other interested parties.
Comment 8 Francis Lynch CLA 2010-02-18 01:14:34 EST
Created attachment 159396 [details]
Patch to org.eclipse.ui.ide plugin

This patch to org.eclipse.ui.ide adds an export wizard (under General) to create refresh snapshots for selected projects. This can be used to test the feature very simply by taking any project and exporting it with this wizard, then deleting that project (but not its content on disk) and then importing the project. The following limitations of this wizard should be noted: (1) the wizard icon needs improvement; (2) F1 context help to explain the feaure in detail is not provided; (3) after exporting the snapshot, you need to refresh the project in order to see the snapshot file itself in the project. The last item will be addressed in a future version of Project#writeRefreshSnapshot.
Comment 9 Szymon Brandys CLA 2010-02-18 08:00:06 EST
Some comments.

The changes in Project#open break tests in core.resources Automated Tests suite. Tests reports

org.eclipse.core.tests.harness.FussyProgressMonitor$FussyProgressAssertionFailed: worked 76.21000000000001 more than totalWork
	at org.eclipse.core.tests.harness.FussyProgressMonitor.assertTrue(FussyProgressMonitor.java:75)
	at org.eclipse.core.tests.harness.FussyProgressMonitor.internalWorked(FussyProgressMonitor.java:123)
	at org.eclipse.core.runtime.ProgressMonitorWrapper.internalWorked(ProgressMonitorWrapper.java:94)
	
It is cause by wrong amount of 'ticks' in line 929 

refreshed = workspace.getSaveManager().restoreFromRefreshSnapshot(
								this, Policy.subMonitorFor(monitor, Policy.opWork * 80 / 100));
								causes
								
I share Martin's concerns about the snapshot location, but we can discuss it later. For now we could just make REFRESH_SNAPSHOT_FILE_LOCATION non-API. 

I would also suggest to configure API Tooling, to show any API breakage/additions. The new REFRESH_SNAPSHOT_FILE_LOCATION field reports 'missing @since tag'. When you make it internal, it won't be an issue anymore.
Comment 10 Martin Oberhuber CLA 2010-02-18 08:22:07 EST
(In reply to comment #9)
Thanks for the comments!

Fixing the progress monitor issue should be easy; we'll make sure that we run the AutomatedTests suite and configure API tooling before making the next contribution.

At the moment, though, I am more interested whether the concept and direction is acceptable. 

Regarding the snapshot file location, what if we introduced
   URI IProject#getSnapshotLocation()
instead of the constant. Then the concrete location remains an implementation detail which we can hide (and for now would always return the same constant value). Note that clients must know the snapshot location, because they will be responsible for transferring the snapshot from workspace A to workspace B after making the snapshot.
Comment 11 Martin Oberhuber CLA 2010-02-18 10:30:53 EST
(In reply to comment #10)
> Regarding the snapshot file location, what if we introduced

Actually, I think it should probably be 
   URI IProjectDescription#getSnapshotLocationURI()
because
 - The snapshot location must be "known" to the project on open/import,
   so it must be stored somewhere, likely the .project file, but some
   compeltely different mechanism could be used

Note that the concept of a snapshot location can be applied more general than just the context described here: In general, any project consists of persisted data (files under the .project location, roughly spoken), and the metadata (what the workspace knows about the project).

Metadata can be contributed and used by many add-ons. The general concept of having a snapshot location related to a project would allow any add-on to persist/unpersist their metadata when cloning a project from workspace A into workspace B. The resource snapshot discussed here would be one example of such metadata to be snapshotted. In the end, project + snapshot would work like a database to import / recreate the project in another workspace.
Comment 12 Szymon Brandys CLA 2010-02-18 11:32:12 EST
We could use a project scope preference to store the snapshot file location.

I was thinking about another approach. We could have IProject#saveProjectShapshot(OutputStream) or IWorkspace#saveProjectSnapshot(IProject, OutputStream). A caller of this API would decide where to store the output.

Then we could have IWorkspace#loadProjectSnapshot(IProject, InputStream) and this would be called on a newly created project. This method would store the snapshot in the same place, where IProject#close stores its snapshot. When the project is opened, it would behave like for a closed project.
Comment 13 Martin Oberhuber CLA 2010-02-18 12:48:04 EST
Project scope Preference is what CDT does. The little problem with applying this to refresh snapshots is that when opening the project, we need to read the snapshot location before the project is open. We could not use normal project preference API for this before the project is open... that's not an unsolveable problem, since project prefs are also just properties files which we could parse ourselves, but still worth noting.

Advantage of a project scope Preference is that clients can easily read/modify the snapshot location in the Project Prefs when needed (no new API needed other than the well-known String key of the Preference slot to use). 

Disadvantage is that if in the future, somebody wants to invent a global policy for where all snapshots are, that could happen under the hood behind a getSnapshotLocationURI() method, but not if clients are used to reading a project Pref.

I do not quite understand your comment about Streams. In the writeSnapshot(OutputStream s) scenario, how would the location of the snapshot be known to the project? Note that I want the normal "project open" operation to just use a snapshot if it's in the expected location, without any extra API call. How would it know where to look? 

Also, I am not sure who would benefit from a loadProjectSnapshot(IProject, InputStream) API. When the project is already open, it is too late since it would have been refreshed already. When it is still closed, would you expect that method to open the project as a side-effect? Or are you thinking about a new option constant like IProject#open(IResource#NO_REFRESH) similar to the current IResource#BACKGROUND_REFRESH, thus also satisfying bug 246565 and allowing the client to load the snapshot in a separate step? - That sounds like a good and flexible idea to me but may lead to some tricky complications since even NO_REFRESH would at least need to refresh the .project file itself, thus leading to multiple resource deltas...

I guess I like the idea of adding flexibility around where the snapshot Stream comes from, but I'm a little confused about the details.
Comment 14 Szymon Brandys CLA 2010-02-19 06:10:57 EST
(In reply to comment #13)
> Note that I want the normal "project open" operation to
> just use a snapshot if it's in the expected location, without any extra API
> call. How would it know where to look?

My idea is to reuse an existing mechanism in #open method.

What #open really does now? When there is a newly created project, it opens the project and performs a refresh to build the project structure. If a project is used and closed, it opens the project and restores its tree from the tree snapshot file.

What I am suggesting, is to be able to replace this snapshot file. Then #open would open the project and build its tree based on this 'replaced' file.
To replace the tree snapshot file, we could use API like IWorkspace#loadProjectSnapshot(IProject, InputStream).

> Also, I am not sure who would benefit from a loadProjectSnapshot(IProject,
> InputStream) API. When the project is already open, it is too late since it
> would have been refreshed already. When it is still closed, would you expect
> that method to open the project as a side-effect? 

We don't have to open a project right when it is created. We can 'create', 'set the snapshot', 'open'. Moreover I think that we would be able to replace this snapshot file each time a project is closed. Then when reopened again, it would pick up the replaced snapshot.

> Or are you thinking about a
> new option constant like IProject#open(IResource#NO_REFRESH) similar to the

We would not need any changes here. As I said, when the snapshot is available, #open just restores the project without refreshing it.
I hope it makes sense now.
Comment 15 Martin Oberhuber CLA 2010-02-19 06:42:08 EST
(In reply to comment #14)
> What I am suggesting, is to be able to replace this snapshot file. Then #open
> would open the project and build its tree based on this 'replaced' file.
> To replace the tree snapshot file, we could use API like
> IWorkspace#loadProjectSnapshot(IProject, InputStream).

Perhaps I'm misunderstanding, but my problem with this approach is that any legacy clients who just know how to #open() will not do the #loadProjectSnapshot() so they will not benefit from the functionality.

I also thought more about Streams, but I'd like to stick with URI because
- It's easy to pass around and persist
- It provides a folder/child semantics that allows additional clients add 
  their snapshots to the location provided by the project/platform
- If a Stream is really needed in the future, one can write an EFS provider
  that interprets the URI to return the proper IFileStore and Stream.

> We can 'create', 'set the snapshot', 'open'.

Ok, if you call it 'set the snapshot' rather then 'load', I can better understand your idea. Essentially, you are making the feature more flexible by decomposing the steps: convert an "unused" project to look like a "used" one by providing a snapshot before opening it. I think this is interesting, but not strictly needed for an initial version; if the first version only loads snapshots from an internally well-known location, we can add API to set the location/decompose steps also later. The less API we publish now, the less we'll have to maintain :)

> Moreover I think that we would be able to replace this
> snapshot file each time a project is closed.

I do not think that this is desirable. Automatic snapshotting on open/close is done in your workspace metadata. Explicit snapshotting is like an export for others, and it should only be done as an explicit action.

I'll post my new API proposal in the next comment.
Thanks!
Comment 16 Martin Oberhuber CLA 2010-02-19 06:56:32 EST
Here is my new take on API to be added. I think that just 2 methods are sufficient for what we need to day, and easy to extend in the future if the need should arise. I added an option constant for #writeSnapshot() since I could envision different kinds of snapshot to export in the future (e.g. bookmarks, or persistent properties) and option constant have proven valuable for evolving API:


   /** Return a folder location, where clients can write snapshots of
    *  project metadata. When importing this project into a workspace
    *  for the first time, clients are expected to look for any snapshot
    *  that they can process in this location and load it if it exists.
    *  @see IProject#saveSnapshot(int,URI)
    *  @since 3.6
    */
   URI IProjectDescription#getSnapshotLocationURI();

   /** Write a snapshot of project meta-data into the given location
    *  URI. The option constant controls what kind of snapshot information
    *  to write. Valid option values include:<ul>
    *    <li>@link{IResource#SNAPSHOT_REFRESH} - snapshot refresh info
    *  </ul>
    *  @see IProjectDescription#getSnapshotLocationURI()
    *  @see #loadSnapshot(int,URI)
    *  @since 3.6
    */
   IProject#saveSnapshot(int options, URI snapshotLocation);

   /** Option constant (value 1) for refresh snapshot information.
    *  @see IProject#saveSnapshot(int,URI)
    *  @since 3.6
    */
   int IResource#SNAPSHOT_REFRESH = 1;


If we do want to add Szymon's decomposition (reading a snapshot from a different URI), I would add the following API which is not strictly needed in a first version:


   /** Load a snapshot of meta-data from the given location, and set the 
    *  location for {@link IProjectDescription#getSnapshotLocationURI()} to
    *  the new specified one. The option constant controls what kind of
    *  snapshot information to load. Valid option values include:<ul>
    *    <li>@link{IResource#SNAPSHOT_REFRESH} - refresh status
    *  </ul>
    *  @see #saveSnapshot(int,URI)
    *  @since 3.6
    */
   IProject#loadSnapshot(int options, URI snapshotLocation);


I did not quite understand the benefit of Szymon's proposal to tie the new API to IWorkspace rather than IProject.
Comment 17 Szymon Brandys CLA 2010-02-19 08:30:12 EST
IProject#saveSnapshot(int options, URI snapshotLocation);
IProject#loadSnapshot(int options, URI snapshotLocation);
are fine.

However I'm not sure about IProjectDescription#getSnapshotLocationURI(). The location of snapshots should be known by callers of new API and the Resources layer doesn't have to know that.

If you want to restore project from a snapshot, wherever it is, you would have to call #loadSnapshot(int, <location of your snapshot file>) before a project is open. Resources don't have to keep any project snapshots other than those created on #close.

UI to restore from snapshot would be a wizard, let's call it 'Import from a snapshot'. This wizard would get the snapshot location. On finish it would create a project, load the snapshot from the given location and then open the project.

What is the reason of IResource#SNAPSHOT_REFRESH? I assume that calling writeSnapshot() with this flag will just update the existing snapshot instead of overriding it. Are we going to implement it?

The reason for adding new methods to IWorkspace is that we already have a similar #loadProjectDescription method there. But I'm not attached to this idea.

One more thing is that I already am able to copy snapshots (saved in .metadata) between closed projects. When they are reopened they recreate the structure based on 'replaced' snapshots. However I have to do it manually by copying files in .metadata.
Comment 18 Martin Oberhuber CLA 2010-02-19 09:20:27 EST
(In reply to comment #17)
> However I'm not sure about IProjectDescription#getSnapshotLocationURI(). The
> location of snapshots should be known by callers of new API and the Resources
> layer doesn't have to know that.

Well, this is an option, and I could live with it for our needs, but I think that my other proposal is more powerful. In my model, when a user exports a snapshot to the well-known default location, that snapshot becomes part of the data it describes. Other users, who import that project, do not need to know whether a describing snapshot is there or not; if it's there it gets used.

The benefit of this approach is, that we don't need a new special wizard to perform the "import from snapshot". It's just a normal import operation, which happens to import project + some exported info describing the project -- essentially similar to the .settings project scope preferences. And especially for immutable data, the snapshot does accurately describe the data, so there is no risk involved.

Workflows for importing remain the same as today (they are just optimized when something was exported); Products on top of Eclipse which don't use our wizards but create/import projects by different means can leverage the new functionality out of the box.

> What is the reason of IResource#SNAPSHOT_REFRESH? I assume that calling
> writeSnapshot() with this flag will just update the existing snapshot instead
> of overriding it.

No, it's not meant to refresh the snapshot. 
It's meant to snapshot refresh information :) 
And it would be the default operation, which is of course implemented. In the future, we could have SNAPSHOT_BOOKMARKS, SNAPSHOT_PERSISTENT_PROPERTIES etc.

> The reason for adding new methods to IWorkspace is that we already have a
> similar #loadProjectDescription method there. But I'm not attached to this
> idea.

I think these are different because they unpersist the project description independent of any project's existence. On the contrary, the snapshotting would always be tied to a project.

I could see some value in having the snapshotting in IWorkspace, if we knew that alternative implementations of IProject are possible (since the snapshotting doesn't need to know any implementation detail in IProject, but it does need to know implementation details of the Workspace Tree). But IProject is @noimplement, so I do not think that's relevant here?

> One more thing is that I already am able to copy snapshots (saved in 
> .metadata) between closed projects. When they are reopened they recreate
> the structure based on 'replaced' snapshots. However I have to do it
> manually by copying files in .metadata.

Well that's the point... it's not documented, the mechanisms inside .metadata are subject to change at any moment. We are after a documented stable API that we can leverage to create and share snapshots in a team. As you see in the patch, we don't need much code to implement this API (because we leverage what you describe).

Thanks for all your input. This is a great discussion to have, and I'm glad that we are digesting the API thoughtfully even if time is short.
Comment 19 Martin Oberhuber CLA 2010-02-19 10:19:49 EST
(In reply to comment #17)
> The reason for adding to IWorkspace is that we already have a similar

When looking again, I noticed IWorkspace#save() which is in fact related. Since an IProject is a resource, and resources are handles, it may in fact be more appropriate to put the new kind of save/load operations into IWorkspace. I'll think a bit more about this.

I also noticed the contract around save numbers/rollback and save participants. We'll need to make sure that the save which we perform into the "external" URI outside the metadata is not assigned a number which is expected to be available for rollback later; and that we behave properly in the event of a crash or other failure while our "external save" is ongoing.

One simple way of ensuring this would be to always save into the normal metadata, and once that is succesful just copy / zip up the relevant files into the user specified URI location.
Comment 20 Francis Lynch CLA 2010-02-20 18:09:35 EST
Created attachment 159678 [details]
Patch v2 to org.eclipse.core.resources plugin

Here is a new version of the core.resources patch that fixes the progress monitor and missing '@since' tag problems reported earlier. The Automated Tests for core.resources all run successfully and API tooling reports no errors on this version. No other changes were made, pending completion of the discussion of possible changes to the API and basic concept.
Comment 21 Martin Oberhuber CLA 2010-02-23 08:02:23 EST
(In reply to comment #20)
> pending completion of the discussion of possible changes to the API

I discussed this with Szymon, and we think that for the completion of this bug, we are going to restrict ourselves to the most basic API needed, see also
comment 16 for some Javadocs:

   static final int IWorkspace#SNAPSHOT_TREE = 1; //option to save refresh info
   IWorkspace#save(IProject proj, int options, URI snapshotLocation);
   IWorkspace#load(IProject proj, int options, URI snapshotLocation);

That is, we remain "close to" the existing IWorkspace#save operation. The client will need to pass in the URI where to save/load from: this is sufficient for clients to write domain-specific wizards and applications to leverage the feature. We are not going to define a "canonical known" API location for snapshots in this bug. The "load" operation will behave well-defined only when loading before the project is opened (that is, it is more like a "set snapshot" operation than a real load).

We are not going to address the UI in this bug, since it is hard to generalize the feature in a way that makes sense for everyone. I am therefore going to file new bugs for providing generic UI/Wizards in the future. Today, I can think of two scenarios: (a) Export to a place "known" to a project, such that any existing open operation can use the snapshot, and (b) UI that supports importing a project from location A while associating it with a snapshot from location B.

Since we are not going to provide generally visible UI for this bug, we'll need some "internal" way of testing the feature. I'm not yet sure how to best do this. One option is keeping the wizard just as attachment on this bug, and having the project-open operation load snapshots from an only internally known, hardcoded location at .settings/resource-info.zip.

Francis, is this sufficient for you to move forward or are there open questions? - If having the API in IWorkspace is problematic, then having it in IProject instead would also be OK.
Comment 22 John Arthorne CLA 2010-02-23 11:28:06 EST
(In reply to comment #21)
>    static final int IWorkspace#SNAPSHOT_TREE = 1; //option to save refresh info
>    IWorkspace#save(IProject proj, int options, URI snapshotLocation);
>    IWorkspace#load(IProject proj, int options, URI snapshotLocation);

Since this operation only operates on a single project, it would make more sense to me for the methods to be on IProject:

>    IProject#save(int options, URI snapshotLocation);
>    IProject#load(int options, URI snapshotLocation);

No other API methods on IWorkspace are specific to a single resource. I understand the reasoning to keep it "near" IWorkspace#save, but to me since the subject of the method is different it should be on IProject (much like we have IWorkspace#build and IProject#build). 

We can add @see references between IWorkspace#save and IProject#save so clients are aware of the two variants. I think I would have chosen import/export here for method names but I can live with save/load.
Comment 23 Martin Oberhuber CLA 2010-02-23 12:35:40 EST
Thanks John, that simplifies matters :) We'll continue going with IProject (and I suggest moving the constant into IProject too). I also suggest keeping the name "saveSnapshot" to prompt the relation to the SNAPSHOT_* constants, and adding a progress monitor for cancellation:

  static final int IProject#SNAPSHOT_TREE = 1; //option to save refresh info
  IProject#saveSnapshot(int options, URI snapshotLocation, IProgressMonitor m);
  IProject#loadSnapshot(int options, URI snapshotLocation, IProgressMonitor m);

For the records, another reason for having it in Workspace was my understanding that IProject is the "model" while Workspace is the "Controller", but that would only be relevant in case there were alternative implementations of IProject (which we don't have).
Comment 24 Szymon Brandys CLA 2010-02-23 12:49:40 EST
(In reply to comment #23)
> I also suggest keeping the
> name "saveSnapshot" to prompt the relation to the SNAPSHOT_* constants, and
> adding a progress monitor for cancellation:

#saveSnapshot and #loadSnapshot were suggested already in previous comments and I like those names more than just #save and #load.
Right, monitors are good here.
Comment 25 Martin Oberhuber CLA 2010-02-24 08:59:33 EST
Since we decided to focus this bug on the API only, I have filed bug 303751 to discuss our options for UI that make the "fast import from snapshot" functionality generally available to Eclipse users.

While I think it's not easy to come up with a UI that "merges" a project location A with a snapshot location B on project-import, I would like to come back to the proposal of a Project-Level Preference:

(In reply to comment #12)
> We could use a project scope preference to store the snapshot file location.

I thus propose adding such a Preference setting. The name of the Preference Slot (String constant) could even be internal non-API for now, although I'd prefer making it API. 

While this would not resolve all scenarios where a project snapshot may be desired, it would at least address the scenario where the "owner" of project data is also responsible for providing a snapshot. Essentially, such a Preference would give the project creator the chance to associate additional meta-info with his project, which can then be used on project-open automatically.

The value of the Preference slot could use Eclipse Variables in order to reference locations "outside" the project -- that's exactly what the CDT team-shared index facility does.

Thoughts?
Comment 26 Francis Lynch CLA 2010-02-26 12:39:37 EST
Created attachment 160342 [details]
Patch v3 to org.eclipse.core.resources plugin

The attached patch implements the refresh snapshot using the loadSnapshot and saveSnapshot methods described in a previous comment. The automated tests all pass with this patch, and I will attach a unittest for the feature next.
Comment 27 Francis Lynch CLA 2010-02-26 12:41:11 EST
Created attachment 160343 [details]
Patch v2 to org.eclipse.core.tests.resources plugin

Attached is a patch to the resources.test plugin with a new version of SnapshotImportPerformanceTest which uses the new loadSnapshot/saveSnapshot API.
Comment 28 John Arthorne CLA 2010-02-28 22:59:34 EST
(In reply to comment #26)
> Created an attachment (id=160342) [details]
> Patch v3 to org.eclipse.core.resources plugin

This looks like the wrong patch - it just adds a translated message.
Comment 29 Szymon Brandys CLA 2010-03-01 08:20:17 EST
Francis, could you attach a fixed patch for core.resources please?
Comment 30 Francis Lynch CLA 2010-03-01 08:58:48 EST
Created attachment 160478 [details]
Patch v4 to org.eclipse.core.resources plugin

Sorry about this, I should have checked the patch before attaching it. For some reason I was not getting a complete patch from the Team|Create Patch command; I had to exit and restart Eclipse before I was  able to get the complete patch. Anyway, here it is.
Comment 31 Szymon Brandys CLA 2010-03-01 11:52:20 EST
I think that loadSnapshot(int options, URI snapshotLocation, IProgressMonitor monitor) may just copy the tree snapshot the workspace metadata area i.e. the place where the project tree is stored on project close. We would just do snapshot file validation before. I think this would simplify the code.
Comment 32 Martin Oberhuber CLA 2010-03-01 12:15:49 EST
(In reply to comment #31)
> I think that loadSnapshot may just copy the tree into the workspace metadata

I'm afraid that I don't understand your idea... if we have to validate the file first, I don't really see how this would simplify the code? - At any rate, the suggestion seems to be related to implementation only, and not to API.

Szymon/John, I'm wondering what your acceptance criteria are before the code can be committed / released to CVS. We have just 1 week to go before the API freeze, so I'd like to make sure that we invest this time in the right work such that this can go into 3.6m6. 

Is your proposal for implementation improvement part of the acceptance criteria for the 3.6m6 API freeze? Or could that also be done later?
Comment 33 Szymon Brandys CLA 2010-03-01 12:46:34 EST
(In reply to comment #32)
> (In reply to comment #31)
> > I think that loadSnapshot may just copy the tree into the workspace metadata
> 
> I'm afraid that I don't understand your idea... if we have to validate the file
> first, I don't really see how this would simplify the code? - At any rate, the
> suggestion seems to be related to implementation only, and not to API.

We may also just copy the snapshot file to .metadata. Then on project open it will be read like for closed projects.
We don't need to do extra validation. However I think that SaveManager#restore (or Project#open) should be modified to handle problems with the snapshot file
differently for NEW and USED projects.

> Szymon/John, I'm wondering what your acceptance criteria are before the code can
> be committed / released to CVS. 

I think we already have an agreement on API and the patch seems to follow the agreement.

> Is your proposal for implementation improvement part of the acceptance criteria
> for the 3.6m6 API freeze? Or could that also be done later?

It was just a suggestion. If there is still time we could try to have both API and the implementation in the final shape.
If it is not possible by M6, I think we could continue work in M7 on implementation.
Comment 34 Szymon Brandys CLA 2010-03-01 12:49:22 EST
(In reply to comment #33)
> If there is still time we could try to have both API and the implementation in the final shape.
I should have said "There is still time, so we can try to have both API and the implementation in the final shape by M6."
Comment 35 Martin Oberhuber CLA 2010-03-02 10:59:00 EST
(In reply to comment #33)
> We may also just copy the snapshot file to .metadata. Then on project open it
> will be read like for closed projects.

Ok, I think I do see your point, but I'm not so much in favor of this for the following reasons:

1. I would still like to see a possibility for marking up project information
   (e.g. project-scope preference) such that the normal open operation 
   loads the snapshot rather than doing a refresh. I still think that for the
   case of distributing a read-only "project database" as data+snapshot,
   this is the right way to go since no extra work is needed on import.
   In this case, the actual snapshot loading needs to be in open() just like
   today but the URI would come from the Preference slot rather than load.

2. When loadSnapshot() copies into the metadata, we need to transfer the data
   twice (origin->metadata, and then metadata->memory). Perhaps not much of
   an issue in terms of performance, but not very pretty.
  
Since we do have agreement on API, what about committing the current patch as attached. This would make it easier to talk about modifications / additions and attach these as smaller separate patches.

(In reply to comment #34)
> we can try to have both API and the implementation in the final shape by M6

Yes I agree, so the question is how to do best get things in now. What I would personally still love to see is at least some kind of UI for trying the feature on real-world projects. Even if it's not the final kind of UI, having this "preliminary UI" by M6 would help us to (a) highlight the feature in New&Noteworthy and thus (b) get a lot more community feedback.

At the moment, the simplest way I see for getting this UI is using the export wizard that's currently attached along with a project-scope preference for importing in the normal open() operation. If we get the current API patch into this week's I-build, I guess it would be easier to get in the UI as well.
Comment 36 Francis Lynch CLA 2010-03-02 11:44:20 EST
Created attachment 160641 [details]
Patch v3 to org.eclipse.core.tests.resources plugin

Here is a new patch to the core.tests.resources plugin that defines two additional unit tests for the refresh snapshot feature.
Comment 37 Szymon Brandys CLA 2010-03-02 12:35:12 EST
(In reply to comment #35)

1. The simplest UI we can suggest at the moment for M6 is Import/Export wizards for snapshots. If we think that project-scope preference is the right way to go, we can introduce it in M7 with PMC approval. It is easier to add new API than remove it.

2. Regarding #loadSnapshot method. I think that 'load' in the name suggests that origin->metadata was done. Otherwise one could modify 'origin' on disc and this could affect IProject#open operation, while changing .metadata is unexpected and unlikely.

3. A comment for tests. Regular tests move to org.eclipse.core.tests.resources package please. The performance test could be added to the Perf Test suite. However your test has to be refactored to use PerformanceTestRunner or WorkspacePertormanceTest class.

To sum up. For M6 I would like to have 2) fixed. You may commit the latest patch and fix 2) by the end of the week. Or wait till you have a complete fix and commit it then. The tests can be released when 3) is fixed. For M6 we may have the UI suggested in 1)

Other ideas should be calmly and cautiously discussed during M7.
Comment 38 Martin Oberhuber CLA 2010-03-02 13:17:17 EST
(In reply to comment #37)
> for snapshots. If we think that project-scope preference is the right way to
> go, we can introduce it in M7 with PMC approval.

I had thought about doing the project-scope preference as internal non-API (a String only known to the Wizard and us) for now. Does that make sense to you?

> 2. Regarding #loadSnapshot method. I think that 'load' in the name suggests
> that origin->metadata was done. Otherwise one could modify 'origin' on disc
> and this could affect IProject#open operation, while changing .metadata is
> unexpected and unlikely.

Good point. I had always felt a bit uneasy about a "load..." method doing really a "setSnapshot" operation. The method felt misnamed.

I don't really buy the argument about changing the snapshot location's data though. First, setting the snapshot is designed to be "close to" opening the project; and second, if I associate a project with a location (URI) to get the snapshot from, and I open the project some days later, then it might even be desired that I get a newer snapshot from the location reference. Essentially, the "setSnapshotLocationURI()" method would associate a snapshot reference with the project reference, regardless of what the content in that reference is. 

We could even go as far as persisting the snapshot reference (like the proposed project properties slot, but set from the outside without changing the prefs which are probably read-only). On multiple open/close/open project cycles, each open operation could then check whether the snapshot in the reference URI is newer than the snapshot made during the last close, and pick up the newer of the two.

Looks like we have two options at this point:
 - either (a) rename loadSnapshot() -> setSnapshotLocationURI()
 - or (b) change semantics of loadSnapshot() to copy into the metadata.
Thoughts?

IMO another benefit of setting a snapshot location URI is that in the future, extenders of project meta-info (i.e. ISaveParticipant) can opt-in to copy additional data from the URI as they see fit. Whereas if we copy the data, we can only copy what we know now and future participants won't get to their data.

One scenario where we want more ISaveParticipants to take the snapshot is when I clone my local workspace, in order to work on a new branch:
  - existing workspace on HEAD with 10 projects
  - File > Switch Workspace > Other > (new ws location)
  - In "Copy Settings", have a new settingsTransfer to copy projects
    --> projects can be cloned into the new workspace with all persistent
        properties etc as needed by simply referencing the original 
        workspace metadata in setSnapshotLocationURI
Or, when I want to clone my local workspace including persistent and contributed properties for another user.

> 3. A comment for tests. Regular tests move to org.eclipse.core.tests.resources
> package please. The performance test could be added to the Perf Test suite.

The current performance test requires data to be provided external to the test. It's more a "try-me-manuall / proof-of-concept" thing. In order to run as part of the perftest suite, the test would need to be self-contained, i.e. generate some fake data. Is this what you have in mind?

> To sum up. For M6 I would like to have 2) fixed. You may commit the latest
> patch and fix 2) by the end of the week. Or wait till you have a complete fix
> and commit it then. The tests can be released when 3) is fixed. For M6 we may
> have the UI suggested in 1)

Ok. Thanks for giving such clear criteria.

For moving forward on (2), I need your opinion regarding loadSnapshot() vs setSnapshotLocationURI(). I think I could live with either of the two, since what we need today is possible with both. It's more a question of future extensibility versus simplicity and clearness of semantics.

For (1) I still feel easier about an export wizard only, with implicit import as specified in project prefs. Perhaps that's because I know this from CDT. If you feel strongly about separate export/import wizards, that should be doable.
Comment 39 Martin Oberhuber CLA 2010-03-02 13:26:41 EST
(In reply to comment #38)
> For moving forward on (2), I need your opinion regarding loadSnapshot() vs
> setSnapshotLocationURI(). 

Perhaps the best argument is this: with setSnapshotLocationURI(), a client who does want the snapshot copied, can copy the snapshot himself to a safe local place and then setSnapshotLocationURI().

On the other hand, a client who wants to set a reference which can change moving forward, can not do that with the load semantics.

But again, I don't have a very strong feeling about this. I can definitely live with the load semantics too. Since it's just a matter of naming, one could even introduce an option constant (SNAPSHOT_OPEN_DEFERRED) in the future, in order to use the existing load API with semantics like setSnapshotLocationURI().
Comment 40 Szymon Brandys CLA 2010-03-03 09:50:22 EST
(In reply to comment #38)
> Looks like we have two options at this point:
> - either (a) rename loadSnapshot() -> setSnapshotLocationURI()
> - or (b) change semantics of loadSnapshot() to copy into the metadata.
> Thoughts?

b) for M6

> > 3. A comment for tests. Regular tests move to org.eclipse.core.tests.resources
> > package please. The performance test could be added to the Perf Test suite.
> 
> The current performance test requires data to be provided external to the test.
> It's more a "try-me-manuall / proof-of-concept" thing. In order to run as part
> of the perftest suite, the test would need to be self-contained, i.e. generate
> some fake data. Is this what you have in mind?

Yes. This should be pretty simple.

> For moving forward on (2), I need your opinion regarding loadSnapshot() vs
> setSnapshotLocationURI(). I think I could live with either of the two, since
> what we need today is possible with both. It's more a question of future
> extensibility versus simplicity and clearness of semantics.

loadSnapshot() in M6.
 
> For (1) I still feel easier about an export wizard only, with implicit import as
> specified in project prefs. Perhaps that's because I know this from CDT. If you
> feel strongly about separate export/import wizards, that should be doable.

We can make both Export/Import wizard for M6.

Martin I looked through you ideas in the previous comment, however I would like people to be more involved in this work. This may happen when snapshots are visible in M6. Even if we need API changes, it is easier to add than remove them after API freeze.

One comment about project scope preference. I think that this is domain-specific whether we want to keep a snapshot location with a project or not. If an application wants to use project scope preferences though, it may define its own preference and change Open Project Action to load snapshot using this preference and then open a project. But this is yet another thought showing that we need further discussion in M7.
Comment 41 Martin Oberhuber CLA 2010-03-03 11:06:48 EST
Accepted all, this is a good path to proceed. Thanks Szymon for all your time. We'll work on getting agreed-on API + refactored tests + im/ex wizards in
by end of the week.
Comment 42 Francis Lynch CLA 2010-03-04 01:12:11 EST
Created attachment 160898 [details]
Patch v5 to org.clipse.core.resources plugin

Attached is a new patch to the core.resources plugin with a possible implementation of Project#loadSnapshot that copies the refresh snapshot into the workspace metadata. Then if that refresh snapshot metadata is present when a new project is opened, it is used instead of performing the refresh. On the assumption that the file can be rewritten at any time to provide updated refresh information, the metadata refresh snapshot is deleted and rewritten whenever Project#saveSnapshot is called. Once used by Project#open, however, it is deleted. I was not able to come up with a way to make the refresh snapshot part of the standard metadata snapshots but it may be possible. Since the project is being opened for the first time, it seemed easiest to look for a predefined snapshot file, in a way similar to previous implementations.
Comment 43 Szymon Brandys CLA 2010-03-04 09:11:26 EST
(In reply to comment #42)
One question I have so far is why not to use project .tree location for the snapshot? I will look at the patch carefully when UI and tests are ready too.
Comment 44 Martin Oberhuber CLA 2010-03-04 10:31:20 EST
(In reply to comment #43)
> One question I have so far is why not to use project .tree location for the
> snapshot? 

With all the sequence numbers in the master tree, we just couldn't figure out what the proper file name was. Plus, we found out that Performance of reading from a .zip file is *much* better than reading from the .tree file (since much less data is transferred from disk).

Essentially, if we end up extracting .zip into .tree and then reading from .tree we are loosing a good deal of the performance value of the feature. That's also why I was not so much in favor of copying into .metadata before, but I can live with it.

Note that we won't have UI done before end of day tomorrow, so in order to make tomorrow's warmup build for the API (which I would love to), I'd love to get your review on the core code. In case you have more comments on the core code, I can work on them.
Comment 45 Szymon Brandys CLA 2010-03-04 12:41:28 EST
(In reply to comment #44)
> With all the sequence numbers in the master tree, we just couldn't figure out
> what the proper file name was. 

1.tree? Projects start from 1. So when they are closed for the first time, 1.tree is created.

> Plus, we found out that Performance of reading
> from a .zip file is *much* better than reading from the .tree file (since much
> less data is transferred from disk).

That's why perf tests would help. 
If you are saying the performance is better, maybe we should consider keeping closed projects snapshots in zip files?
Comment 46 Szymon Brandys CLA 2010-03-04 12:50:32 EST
Patch v05 looks good. LocalMetaArea copyright just need to be updated. I think you can commit it for the warmup build.

(In reply to comment #45)
My previous comment concerned some implementation details, however we can discuss it later.
Comment 47 Martin Oberhuber CLA 2010-03-04 18:30:45 EST
Created attachment 161049 [details]
Amendments to core.resources patch v5

I committed core.resources patch v5 with some minor amendments:

- Fixed LocalMetaArea copyright header

- Unified the contribution comment to read the same on all headers

- IProject - Some minor clarifications on API Javadocs, e.g. SNAPSHOT_REFRESH
  was still referenced from some docs while the constant is now SNAPSHOT_TREE

- Project#loadSnapshot() - IFileStore#copy() was used with incorrect options,
  IFileStore#delete() is unnecessary with EFS.OVERWRITE option on copy. 

- Project#saveSnapshot() - Removed getWorkManager().operationCanceled() since
  this is not in the context of a workspace operation, and it's not necessary
  since we only read the workspace (which is thread-safe). There is still a
  possible race condition if multiple Threads try saveSnapshot to the same
  URI, but I think this is natural and expected. Concurrent snapshots to
  different URIs should work just fine.

- SaveManager#saveRefreshSnapshot() - Fixed a potential case where the 
  OutputStream on the URI would not be closed in case saving the Tmp tree
  runs into a problem (such as out of disk space).

All tests are green, functionality looks good to me. I also reviewed the code for potential race conditions and found two, but I do not think that these need to be addressed:
- loadSnapshot(), multiple Threads doing this at the same time would trample
  on each other's data. Likely the last one would win, losers could run into 
  a "cannot write" CoreException.
- saveSnapshot(), multiple Threads saving to the same URI would trample on 
  each other, but saving to different URI's at the same time should work 
  just fine.
Comment 48 Martin Oberhuber CLA 2010-03-04 18:38:39 EST
(In reply to comment #45)
> maybe we should consider keeping closed projects snapshots in zip files?

Yes, saving .tree snapshots in ZIP format may be worthwile. On an org.eclipse.swt project, the tree is 309KB unzipped but 59KB zipped. Saving the tree (on project close) might get slightly slower, but opening the tree should get faster. Perhaps that's something to consider for the Performance milestone.
Comment 49 Szymon Brandys CLA 2010-03-12 10:01:46 EST
Can we mark it fixed and raise separate bugs for other issues?
Comment 50 Martin Oberhuber CLA 2010-03-12 13:38:31 EST
Marking FIXED since the API is in 3.6m6 and works as per the original acceptance criteria. I have created following bugs for follow-up:

 Bug 303751 -  Need a UI for fast project import from snapshot data
 Bug 305716 -  Add Unittests for fast project import from snapshot data
 Bug 305717 -  Investigate storing resource tree snapshots in ZIP format
 Bug 305718 -  Cannot rename a project on fast project import from snapshot

I'm going to mark patches obsolete on this bug that are now being tracked by other bugs. Francis please attach your refactored patch for unittests to 
bug 305716.
Comment 51 Martin Oberhuber CLA 2010-03-19 15:34:08 EDT
I have created two more work items to follow-up on this new functionality,
making up for a total of 6 now:

>  Bug 303751 -  Need a UI for fast project import from snapshot data
>  Bug 305716 -  Add Unittests for fast project import from snapshot data
>  Bug 305717 -  Investigate storing resource tree snapshots in ZIP format
>  Bug 305718 -  Cannot rename a project on fast project import from snapshot

Bug 306573 - Add Performance tests for fast project import from snapshot data
Bug 306575 - Project-scope preference for refresh snapshot with standard import
Comment 52 Martin Oberhuber CLA 2010-04-14 16:44:49 EDT
(In reply to comment #51)
I have created one more work item to follow-up, making up for a total of 7:

> Bug 303751 - Need a UI for fast project import from snapshot data
> Bug 305716 - Add Unittests for fast project import from snapshot data
> Bug 305717 - Investigate storing resource tree snapshots in ZIP format
> Bug 305718 - Cannot rename a project on fast project import from snapshot
> Bug 306573 - Add Performance tests for fast project import from snapshot data
> Bug 306575 - Project-scope preference for refresh snapshot with standard import

Bug 309218 - Add tests for linked resources in fast project import