Bug 207178 - [api] Optimize FileServiceSubSystem.listFoldersAndFiles method
Summary: [api] Optimize FileServiceSubSystem.listFoldersAndFiles method
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 2.0   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: 3.0 M3   Edit
Assignee: David McKnight CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords: api
Depends on:
Blocks: 209552 221211
  Show dependency tree
 
Reported: 2007-10-23 11:08 EDT by Ankit Pasricha CLA
Modified: 2008-03-04 09:37 EST (History)
3 users (show)

See Also:


Attachments
patch for file service and subsystem API changes (95.63 KB, patch)
2007-10-31 16:59 EDT, David McKnight CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ankit Pasricha CLA 2007-10-23 11:08:52 EDT
Currently, FileServiceSubSystem.listFoldersAndFiles resolves one filter string at a time. There is no way to pass a bunch of filter strings to the method at once. 
FileServiceSubSystem.listFoldersAndFiles eventually calls IFileService.getFilesAndFolders. If it was to take a list of filter strings, then the file service can perform a batch filter resolution instead of doing it one by one. This will improve performance.
Comment 1 David McKnight CLA 2007-10-23 17:06:59 EDT
This would involve new API(s) to the file subsystem and possibly new API(s) for the file services.  

The history of this one is that, in the IBM RSE, we had a special method listFoldersAndFiles(List) that allowed a query of a number of things in one shot instead of in sequence.  For dstore, this is faster than sequential because we could send all the queries at once and only have to wait until the last query is done (i.e. no waiting in between queries).  I'm not sure whether this would provide any improvement for SSH or FTP.  One of the IBM products relies heavily on this API so we'll need an equivalent here.  

I know something like this has been requested before.  Martin, I think you may have had some reservations about this because you felt that the same thing could be acheived by running a series of concurrent queries in jobs.  I still think that this would be a useful API for consumers and we may want to consider others.  We do already have a getRemoteFileObjects(List folderOrFileNames).

For consistency, it may be useful to have:

listFiles(List parentFolderNames,...);
listFolders(List parentFolderNames,...);
listFilesAndFolders(List folderNames,...);

We may be better off using IRemoteFile[] instead of the List.  We may also be better providing a filterString or an array of filterStrings (for each folder).

Any opinions on this?



  

Comment 2 Martin Oberhuber CLA 2007-10-24 11:03:15 EDT
I'm in favor of new API that allows for optimized implementations at the server.

I'm against adding new methods just for consistency or convenience; we have too many methods already. I'd rather have the listFolders() and listFiles() methods removed completely when the same result can be achieved by listFoldersAndFiles().

One question I have is, that when querying a list of items gives a list of results, how would the client know what results belong to what input queries?

One concern I have is that when querying a list of items (or a list of filters), and anything goes wrong, we should think about proper error reporting. Assume that you query all files from folders A,B,C and folder B turns out to not have proper permissions, what should the query do?
  * Fail completely with a message that B does not have permissions?
  * Still return results for A and C?

Similarly, if there are failures accessing A (does not exist) and B (no permission), what should the query do?
  * Return a SystemMessageException for the first error encountered?
  * Return kind of a MultiStatus?

A "fault tolerant" strategy would query what it can and return a MultiStatus for any errors encountered. Such a strategy needs less client/server traffic.
A "simple" strategy would fail at the first error and only return failure info for that error encountered.
Comment 3 David McKnight CLA 2007-10-24 11:39:46 EDT
In the IBM RSE version, when the requirement came up, I wasn't comfortable adding new API at the RemoteFileSubSystem level.  Instead, because subsystems were subclassed for customization, the file subsystem for dstore got the additional method and consumers of that method were forced to cast to the dstore subsystem.

Now, in Open RSE, we have the final subsystem, FileServiceSubSystem, to take care of most file subsystem needs.  Customizations for this are at the file service layer rather than at the subsystem layer.  This problem here is that the consumer wants to deal with IRemoteFiles (not IHostFiles) so the remote file subsystem layer is where the calls are being made. 

I understand that we don't want to bloat the APIs.  Along with this defect, we have defects bug 162195 and bug 162949 that are requesting similar function in the sense that they want to be able to work in batches of resources in order to be more optimal.  Despite this maybe complicating the APIs, such things would make it easier for consumers to do more powerful things with RSE.

If we did do it, the file service layer would need to implement the new APIs.  I would imagine that we could provide a default implementation in AbstractFileService that just iterates through the single calls.  Services that have nothing to gain from mass queries, wouldn't need to change anything, while services like dstore could provide optimizations.  At the subsystem layer, we would need to address the issues you mention (i.e. what to do when one folder query fails out of many?).






	


	


Comment 4 Martin Oberhuber CLA 2007-10-24 11:46:22 EDT
The questions I raised about error reporting and data returning are not to be addressed at a particular layer, but to be defined in the API. All layers will need to be aware of that definition when working with the new API.

As I stated, I'm not against adding API for mass queries. I'm just against bloating the API unnecessarily by convenience methods.

Isn't the new API you propose at the heart just an ISubSystem.resolveFilter() with a complex filter? I cannot quite see why we need separate methods for listFolders(), listFiles() and listFoldersAndFiles(). The very same can be accomplished with a single methods plus filter. We could provide default filters that clients can just instantiate if they want "all files only" or "all folders only". Or am I missing something?

I agree that the AbstractFileService can have default implementations of the new API which just iterate over the List, such that the API change is even backward compatible for clients which only provide a service.
Comment 5 David McKnight CLA 2007-10-24 12:07:26 EDT
(In reply to comment #4)
> The questions I raised about error reporting and data returning are not to be
> addressed at a particular layer, but to be defined in the API. All layers will
> need to be aware of that definition when working with the new API.

Agreed.

> Isn't the new API you propose at the heart just an ISubSystem.resolveFilter()
> with a complex filter? 

I'm not sure resolveFilter() would be the right place for this.  In a lot of cases, the list*() APIs are used directly by consumers and the need to deal with filters are avoided.

>I cannot quite see why we need separate methods for
> listFolders(), listFiles() and listFoldersAndFiles(). The very same can be
> accomplished with a single methods plus filter.

I don't like having listFolders, listFiles and listFoldersAndFiles either.  Now is probably the time to consolidate them all into listFoldersAndFiles and do away with the others.

What do you think?


Comment 6 Martin Oberhuber CLA 2007-10-25 08:48:03 EDT
Hmm, API changes need to be well thought out.

For the Service layer, I find the current situation not too bad: clients can call either getFiles() getFolders() getFilesAndFolders(), but implementors only need to implement internalFetch() because the others are dealt with in AbstractFileService.

Even here, the situation could be changed to have less API - like java.io.File which only has a single list() method; or EFS which only has IFileStore#childStores() and does not do filtering. What we'll need to do is weigh ease-of-use for clients against ease-of-implementation.

On the IRemoteFileSubsystem, we currently have 9 (!) convenience variants of list* which I find too much. 

Can you make a suggestion going forward?
How would you propose adding the multi-queries on the Service layer?
Comment 7 David McKnight CLA 2007-10-25 15:43:25 EDT
At the service layer, if we don't considate the current APIs into one, we might have something like this (mind the naming convention i.e. "EnMasse" as I couldn't think of a better name):
		
	/**
	 * @param remoteParents - the list of remote parents
	 * @param names - the list of file names
	 * @param monitor the monitor for this potentially long running operation
	 * @return the host files given the parent paths and file names.  This is basically a batch version of getFile().
	 *     Must not return <code>null</code>, non-existing files should be
	 *     reported with an IHostFile object where {@link IHostFile#exists()}
	 *     returns <code>false</code>.
	 * @throws SystemMessageException if an error occurs. 
	 * Typically this would be one of those in the RemoteFileException family.
	 */
	public IHostFile[] getFileEnMasse(String remoteParents[], String names[], IProgressMonitor monitor) throws SystemMessageException;
	
	/**
	 * @param remoteParents - the names of the parent directories on the remote file 
	 * system from which to retrieve the collective child list.
	 * @param fileFilters - a set of strings that can be used to filter the children.  Only
	 * those files matching the filter corresponding to it's remoteParent make it into the list.  The interface 
	 * does not dictate where the filtering occurs.
	 * @param monitor the monitor for this potentially long running operation
	 * @return the collective list of host files that reside in each of the remoteParents with it's corresponding filter. 
	 * @throws SystemMessageException if an error occurs. 
	 * Typically this would be one of those in the RemoteFileException family.
	 */
	public IHostFile[] getFilesAndFoldersEnMasse(String[] remoteParents, String[] fileFilters, IProgressMonitor monitor) throws SystemMessageException;
  
	/**
	 * @param remoteParents - the names of the parent directories on the remote file 
	 * system from which to retrieve the collective child list.
	 * @param fileFilters - a set of strings that can be used to filter the children.  Only
	 * those files matching the filter corresponding to it's remoteParent make it into the list.  The interface 
	 * does not dictate where the filtering occurs.
	 * @param monitor the monitor for this potentially long running operation
	 * @return the collective list of host files that reside in each of the remoteParents with it's corresponding filter 
	 * @throws SystemMessageException if an error occurs. 
	 * Typically this would be one of those in the RemoteFileException family.
	 */
	public IHostFile[] getFilesEnMasse(String[] remoteParents, String[] fileFilters, IProgressMonitor monitor) throws SystemMessageException;
  
	/**
	 * @param remoteParents - the names of the parent directories on the remote file 
	 * system from which to retrieve the child list.
	 * @param fileFilters - a set of string that can be used to filter the children.  Only
	 * those files matching the filter corresponding to it's remoteParent make it into the list.  The interface 
	 * does not dictate where the filtering occurs.
	 * @param monitor the monitor for this potentially long running operation
	 * @return the collective list of host files that reside in each of the remoteParents with it's corresponding filter. 
	 * @throws SystemMessageException if an error occurs. 
	 * Typically this would be one of those in the RemoteFileException family.
	 */
	public IHostFile[] getFoldersEnMasse(String[] remoteParents, String[] fileFilters, IProgressMonitor monitor) throws SystemMessageException;	

What do you think of that?

Comment 8 David McKnight CLA 2007-10-25 17:41:43 EDT
It would be nice if we could get rid of some APIs.  

One approach would be to change things to always return files and folders (but just filter them out above the subsystem layer).  That would allow us to have this:

At the file service layer (where files and folders are just called "files")
---------------------------
getFile(String remoteParent, String name, IProgressMonitor monitor)
getFiles(String remoteParent, String fileFilter, IProgressMonitor monitor)

..and with the additional methods:

getFileEnMasse(String[] remoteParents, Stringp[ names, IProgressMonitor monitor)
getFilesEnMasse(String[] remoteParents, String[] fileFilters, IProgressMonitor monitor)
  
At the file subsystem layer
----------------------------
getRemoteFileObject(...);
getRemoteFileObjectEnMasse(...);


If we really want to distinguish between files and folders at these levels, then I suppose we could pass a flag down to the service layer like this:

FILES_ONLY = 0 
FOLDERS_ONLY = 1
FILES_AND_FOLDERS = 2



What do you think?


Comment 9 David McKnight CLA 2007-10-31 12:04:35 EDT
From the meeting discussion today, we'll probably do something like this:

IFileService:

  getFile(String remoteParent, String name,...)
  getFileMulti(String remoteParents[], String names[], ...)

  list(String remoteParent, String filter, int fileType, ...)
  listMulti(String remoteParents[], String filters[], int fileType, ...)

And something similar will follow for the subsystem.
Comment 10 David McKnight CLA 2007-10-31 16:59:23 EDT
Created attachment 81784 [details]
patch for file service and subsystem API changes
Comment 11 David McKnight CLA 2007-10-31 17:05:41 EDT
I've committed the patch.  I still need to put in the code for optimizing the queries in the dstore file service.
Comment 12 David McKnight CLA 2007-11-01 16:19:47 EDT
I've added optimizations for the dstore file service such that getFileMulti and listMulti take advantage of sending a batch commands then waiting, instead of waiting for them one at a time.  

I've also added junit tests to the FileSubsystemConsistencyTestCase to exercise the different services using both regular queries and multi queries.

 
Comment 13 Ankit Pasricha CLA 2007-11-03 17:25:18 EDT
Dave,

I took a look at the changes that you made. At the moment, the listMulti methods takes a 'fileType' parameter. However I can only pass in 1 fileType parameter for the entire list of filter strings. I think it would be better for the listMulti methods to take an array of fileTypes, one for each filter string. 
What do you think?
Comment 14 David McKnight CLA 2007-11-05 09:56:50 EST
I guess it makes it more consistent and flexible for us to pass in a fileType per parent/filter.  The only annoyance may be that users of the API might have to take an extra step before each call to construct the array like this:

	int[] types = new int[remoteFiles.length];
	for (int t = 0; t < remoteFiles.length; t++)
	{
	   types[t] = IFileServiceConstants.FILE_TYPE_FILES_AND_FOLDERS;
	}

I'll change the API accordingly.


				
Comment 15 Martin Oberhuber CLA 2007-11-05 11:20:25 EST
If you're afraid of the extra step of creating the argument, the API could also expose predefined constants for those kinds of filters which are the same for all arguments:

interface IFileServiceConstants {
    final int[] FILE_TYPE_FILES_AND_FOLDERS_ALL = new int[0];
    final int[] FILE_TYPE_FILES_ALL = new int[0];
    final int[] FILE_TYPE_FOLDERS_ALL = new int[0];
}

these int[] objects can be passed into the parameter,
the AbstractFileService#listMultiple() method can test for them easily:

  if (typeArray.length==0) {
     int realType = IFileServiceConstants.FILE_TYPE_FILES_AND_FOLDERS;
     if (typeArray == IFileServiceConstants.FILE_TYPE_FILES_ALL) {
         realType = IFileServiceConstants.FILE_TYPE_FILES;
     } else if (typeArray == IFileServiceConstants.FILE_TYPE_FOLDERS_ALL) {
         realType = IFileServiceConstants.FILE_TYPE_FOLDERS;
     }
     realTypeArray = new int[parentDirs.length];
     for(int i=0; i<parentDirs.length; i++) {
        realTypeArray[i] = realType;
     }
     typeArray = realTypeArray;
  }

the other option would be to allow a variant of the listMultiple() method which takes a single int instead of an int[]. Which would mean one additional API method more, but 3 additional API constants less.

I guess it depends on expected API usage. Will those clients which we think intend to use the listMultiple() API be happy with having to fill the array themselves?
Comment 16 Martin Oberhuber CLA 2007-11-05 11:21:57 EST
The other thing is, I'm not yet quite happy with the naming (getFile / getFileMulti). I think I'd prefer either

getFile() / getFiles() / listFolder() / listFolders()

or

getFile() / getFileMultiple() / list() / listMultiple()
Comment 17 David McKnight CLA 2007-11-05 11:39:07 EST
The problelm with:

getFile() / getFiles() / listFolder() / listFolders()

is that there is some ambiguity to the meaning.  A name like getFiles() may imply getting a list of a results similar to that of the list methods while a name like listFolders() may imply listing only the folders that are within a folder.

I'm like the suggested naming for these though:

getFile() / getFileMultiple() / list() / listMultiple()

As for the fileType(s) arguments I think I prefer the idea of just adding another API that isn't the array - it would seem more natural to program against that.

Comment 18 Martin Oberhuber CLA 2007-11-05 11:52:17 EST
> As for the fileType(s) arguments I think I prefer the idea of just adding
> another API that isn't the array - it would seem more natural to program
> against that.

I agree, but what sort of clients programs against the ...Multiple() API anyways? They would typically need to go to some lengths of computing the proper arguments for the Multiple() calls anyways, and also demultiplex the results.

Such a client is inherently complex, such that enforcing the additional complexity of computing the int[] array is comparatively small.

An additional API method or API constants, on the other hand, hurt all clients since it makes the API more complex.

I'm thus in favor of keeping the current int[] API only. No additional constants, no additional API methods.
Comment 19 David McKnight CLA 2007-11-05 12:13:33 EST
(In reply to comment #18)
> > As for the fileType(s) arguments I think I prefer the idea of just adding
> > another API that isn't the array - it would seem more natural to program
> > against that.
> I agree, but what sort of clients programs against the ...Multiple() API
> anyways? 

My answer would be those clients that want a performance boost.  For example, with my junit tests, I've found a 25% improvement in performance for dstore multi queries vs the equivalent multi calls to regular queries.  The benefit of the added APIs (i.e. int fileType as well as int[] fileTypes) are that it would make consumers less resistent to try them.     

> They would typically need to go to some lengths of computing the
> proper arguments for the Multiple() calls anyways, and also demultiplex the
> results.

For some clients, demultiplexing may not be necessary - I believe the originator for this enhancement just needs the raw results.  For other consumers using FileServiceSubSystem.listMultiple() the task of demultiplexing the results is already solved in that each parent IRemoteFile stores it's corresponding results in it's cache via IRemoteFile.setContents().


Comment 20 Martin Oberhuber CLA 2007-11-14 08:50:46 EST
This is not a breaking API change for clients that extend AbstractFileService rather than implementing IFileService directly.

However, the deprecated methods
  getFiles()
  getFolders()
  getFilesAndFolders()
may be removed soon in favor of the new list() API.
Comment 21 Martin Oberhuber CLA 2007-11-23 05:09:02 EST
Bug 209552 deals with removal of the old deprecated API for listFiles(), listFolders() and listFoldersAndFiles(). 

Bug 204307 deals with the required chang in API semantics for listFolders().