Bug 256355 - [metadata] query() forces returning the argument
Summary: [metadata] query() forces returning the argument
Status: RESOLVED FIXED
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 3.5   Edit
Hardware: PC Windows Vista
: P3 enhancement (vote)
Target Milestone: 3.6 M5   Edit
Assignee: Ian Bull CLA
QA Contact:
URL:
Whiteboard:
Keywords: api
Depends on:
Blocks: 256418
  Show dependency tree
 
Reported: 2008-11-24 18:14 EST by Jeff McAffer CLA
Modified: 2009-12-08 02:10 EST (History)
5 users (show)

See Also:


Attachments
org.eclipse.equinox.p2.metadata.query.patch (17.98 KB, patch)
2009-10-30 16:04 EDT, Ian Bull CLA
no flags Details | Diff
mylyn/context/zip (56.35 KB, application/octet-stream)
2009-10-30 16:04 EDT, Ian Bull CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff McAffer CLA 2008-11-24 18:14:42 EST
The current JavaDoc on IQueryable.query() says
	 * @return The collector argument

This seems overly restrictive as it prevents people from providing their own implementation of Collector that is better tuned for their queryable.
Comment 1 John Arthorne CLA 2008-11-24 21:01:02 EST
Hmm, the idea here is that the client performing the query could pass in some customized Collector subclass to do additional filtering, etc. For example a client could pass in HasMatchCollector which short-circuits on the first match, or LatestIUVersionCollector, which only collects the IU with the highest version. 

I'm not sure how we can support both custom collectors for clients to pass in, as well as the queryable being able to return a different collector from the one passed in. I suppose it could work if the queryable returned a wrapper around the collector parameter, so that the client's collector eventually receives the query result.
Comment 2 Jeff McAffer CLA 2008-11-24 22:01:59 EST
Yes, the wrapper idea feels like a reasonable way to go.  in the end the collector /query stuff was to allow for optimized repo implementations.  As it is, with everyone using Collector and Collector being ArrayList based, things are going to get stashed in memory.  Allowing query() to return a different Collector is a start down the path to more flexibility.
Comment 3 Ian Bull CLA 2008-11-25 01:28:56 EST
John,

Wouldn't the client need to pass a special query (hasMatchQuery) to short circuit on this, not a special collector?  For example, in the standard query class, the perform operation is:

	public Collector perform(Iterator iterator, Collector result) {
		while (iterator.hasNext()) {
			Object candidate = iterator.next();
			if (isMatch(candidate))
				result.accept(candidate);
		}
		return result;
	}

Which completely ignores the java doc that says if accept returns false, we should short circuit.  (Since the result of accept is never checked, this will keep adding candidates until they are all checked).    
Comment 4 Pascal Rapicault CLA 2008-11-25 08:51:17 EST
See also bug #256411
Comment 5 John Arthorne CLA 2008-11-25 11:48:52 EST
Re comment #3: that's a bug
Comment 6 Ian Bull CLA 2008-11-25 12:06:33 EST
(In reply to comment #5)
> Re comment #3: that's a bug
> 
Yap, it's clearly a bug with the current JavaDoc, but what about the idea that the caller of Repo uses a Query to control intelligence (getFirstMatchQuery), and the Repo itself uses a custom collector to control access to the items?  

Although this will introduce the same issues you brought up in But #209992:

<snip>
- Returning an Iterator result is problematic. This makes it very difficult for
an IQueryable implementation to make future optimizations because the client
may hang onto the Iterator for unbounded periods of time. Say for example an
IQueryable wanted to stream IUs from a network connection or from disk - it
would have no way of knowing when the client is done with the iterator and when
the sockets/files can be safely closed. Also, it is difficult to make a method
returning Iterator thread-safe. What happens if the IQueryable changes while a
client is iterating? Another minor inconvenience is that an iterator cannot be
reset. The client must perform the query again to obtain a fresh iterator, or
copy the results into a collection.
</snip>

But if we want to support a lazy collection (call it an iterator if want), it will need to address these specific issues.

Comment 7 Ian Bull CLA 2008-11-28 19:01:29 EST
I would like to take a stab at this bug.

Here are the parameters I see:
- The Collector returned must wrap (or at least call accept) on the collector provided by the caller

- The collector must provide a lazy way to access elements (using something like hasNext() and getNext())

- The collector must support getCollection and getArray. These methods are discouraged, as they don't take advantage of the lazy load, they are required!

- The collector should (must?) support some sort of dispose / close / finish operation.  This method should be called by the client when they no longer need the object (to allow the collector to close streams, etc...).  

- I would also like to propose we remove the size() method.  Of course this is up-for-debate, but I want callers to either use hasNext() / next() or if they really need all the results, then they get a collection.  (It would be a shame if someone created a loop and used size(), but then used next() to get the items).

Thoughts?
Comment 8 Pascal Rapicault CLA 2008-12-03 11:44:45 EST
Ian, feel free to ahead on that.
Comment 9 John Arthorne CLA 2008-12-03 14:40:42 EST
I suspect requiring clients to close/dispose will be the most difficult change here.
Comment 10 Pascal Rapicault CLA 2008-12-03 16:40:03 EST
I don't like the model of explicit dispose / close esp since collectors can be passed around and noone really has a good visibility on that. I would rather use a softreference cache approach handled by the repository.
Comment 11 Pascal Rapicault CLA 2008-12-03 21:54:35 EST
Ian, re-reading the discussions it seems that the change will be pretty broad. Please describe the set of changes that will be made before going forward with them and make sure that these also address the concerns of bug #256411. Thx.
Comment 12 Ian Bull CLA 2008-12-03 22:24:48 EST
I'm not a fan of the explicit dispose either.  However, if we want to support fetching from across the wire, in a lazy way, with the ability to restart the iteration and then pass the collector around, we will need some sort of lifecycle model (so the repository knows when the collector is done, and the stream can be closed).  I guess the finalize method could be used, but I've been discouraged from using this before.  You probably have more knowledge about that then me.

(In reply to comment #11)
> Ian, re-reading the discussions it seems that the change will be pretty broad.
> Please describe the set of changes that will be made before going forward with
> them and make sure that these also address the concerns of bug #256411. Thx.
> 

Yap.  We're still pretty early in the investigation.  I just got my custom repository working, so now I can look and see how broad this change will be.  And yep, I'm watching bug #256411. 
Comment 13 Pascal Rapicault CLA 2008-12-05 15:39:18 EST
The soft reference approach is something that we are using in other places in eclipse (eg extension registry, repo mgr) and works relatively well. With it you have the ability to have a callback when reference is being disposed (not using the finalize) which gives you the ability to do the cleanup.
Comment 14 Ian Bull CLA 2008-12-05 16:58:08 EST
(In reply to comment #13)
> The soft reference approach is something that we are using in other places in
> eclipse (eg extension registry, repo mgr) and works relatively well. With it
> you have the ability to have a callback when reference is being disposed (not
> using the finalize) which gives you the ability to do the cleanup.
> 
Sorry Pascal if this is an obvious question (maybe you can just point me at the existing code), but how does this give me the ability to do cleanup?  i.e. how do I know the object is being disposed if we don't use some sort of dispose / i'm done with this object, method?

I thought the soft reference just allows the GC to collect the object if there are no hard references (and it needs the memory).
Comment 15 John Arthorne CLA 2008-12-05 17:06:57 EST
I think Pascal meant weak references. You would maintain a weak reference to the collector object returned to the client. Once the collector is only weakly reachable (the client no longer refers to it) it will be added to the ReferenceQueue associated with the weak reference. You would have a thread processing the reference queue and performing any associated cleanup for the object that is no longer used by the client.
Comment 16 Ian Bull CLA 2008-12-05 17:37:38 EST
(In reply to comment #15)
> I think Pascal meant weak references. You would maintain a weak reference to
> the collector object returned to the client. Once the collector is only weakly
> reachable (the client no longer refers to it) it will be added to the
> ReferenceQueue associated with the weak reference. You would have a thread
> processing the reference queue and performing any associated cleanup for the
> object that is no longer used by the client.
> 

Wow!  I've never actually use the weak/soft reference stuff before, this is pretty cool.  I just put together a small example to see how it works. It still assumes that the GC will run.  Is this something we can always expect? Or do we also put some clean-up code in the bundle shutdown methods?


Comment 17 Pascal Rapicault CLA 2008-12-05 22:07:42 EST
By knowing the memory consumption of eclipse in general I think we'd better assume that there is a GC that runs :) Note though that I would not expect this mechanism to leave in p2 itself but instead leave in the DB backed repository.
Comment 18 John Arthorne CLA 2008-12-08 08:59:49 EST
Clean-up during bundle shutdown is good practice. A client could be still holding a reference, which would prevent the reference from becoming weakly reachable, so explicit cleanup during shutdown makes sense.
Comment 19 Ian Bull CLA 2008-12-21 23:56:03 EST
I spent part of the weekend playing with this code, and I understand the collector problem now.  There are a few collectors that require the entire collection to be processed before anything can be returned.  There are even a few "renege" collectors, that is, collectors that "accept" an element, and then later remove them.  This cannot be used in a lazy fashion!

I have a proposal that should help maintain backwards compatibility, and support lazy collection (where possible).

1. Have collector implement iterator (or maybe enumerator).
2. Add addAsynchronousQuery(Iterator data, Query query) to the Collector.  
3. Add asynchronousPerform to Query:  this defers the perform to the collector

<code>
public final Collector asynchronousPerform(Iterator iterator, 
                                           Collector result) {
  return result.addLazyQuery(iterator, this);
}
</code>

4. Implement hasNext() / next() as follows:
  a. Check if anything is left in the collection (those added by accept)
  b. If there is an Asynchronous Query check hasNext() / next() for the next element
  c. if toCollection / toArray is called, process the entire asynchronous query to collecto everything

also, since there are a few specialized collectors that do not support "asynchronous collection", I added a flag (supportsLazy) which default is true.  In this cases, the specialized versions sets this to false.  If a subclass tries to access the collection (to add or remove from it manually) they must have supportLazy == false.

This proposal allows repositories or other IQueryables to seed the collector with an iterator, which of course can be used in an lazy manner.  I have added a postPerform callback that can be used to close streams, etc...

The API contract for this iterator is as follows:  If hasNext() returns true, then next() will return a valid element.  this means that all the work actually happens in the hasNext().  

If it is easier, I can post the code for this.


Comment 20 Susan McCourt CLA 2008-12-23 11:18:35 EST
(In reply to comment #19)
> I spent part of the weekend playing with this code, and I understand the
> collector problem now.  There are a few collectors that require the entire
> collection to be processed before anything can be returned.  There are even a
> few "renege" collectors, that is, collectors that "accept" an element, and then
> later remove them.  This cannot be used in a lazy fashion!

I think that most collectors that access the list are working around the lack of some more intelligent queries.  The attempt is to perform "query-ish" things while already iterating the candidates so that the returned results don't have to be processed a second time.

I think if we make a list of the queries that behave in this manner and implement them on the receiver side, then you could probably get away with a pluggable acceptor.  You could even argue that the wrapping stuff done in accept methods could be done back on the client side, but then you subvert optimizations that could be made when the query is not lazy (by forcing the receiver to iterate on the results again).  

I would prefer a solution that also looks at these queries rather than having a modal collector.  

The queries that the UI needs that rely on collector processing today:

- latest version of a particular IU
- uncategorized elements  (this one is particularly brutal because first you have to collect everything that's categorized and then iterate again for the ones that you didn't see before).


Note this has been discussed in other bugs, needed for other reasons, so I think the real solution lies in doing this first and then figuring out what collectors need.  See bug 256412 and bug 227675 comment 12.
Comment 21 Ian Bull CLA 2009-10-30 14:53:00 EDT
We are almost in a position to completely remove the collector argument to the IQueryable.  Collectors are used for 3 purposes:

1. Logic of what to accept
2. Short Circuit 
3. Control how the results are stored (how duplicates are handled).

I would argue that #1 should not happen and instead a query should be used. As for #2 and #3, I propose we change the IQueryable interface to be

Collector IQueryable#query(IQuery query, int numberOfResults, boolean suppressDuplicates)

The queryable would then be free to construct it's own "collector" or whatever if wants to return the results. This will allow Queryables to provide more specific query mechanisms for their particular data store. (Using the non opaque queries)

We are very very close to being able to do this today.
Comment 22 Ian Bull CLA 2009-10-30 16:04:05 EDT
Created attachment 150977 [details]
org.eclipse.equinox.p2.metadata.query.patch

If we choose to remove the collector argument from the IQueryable#query and replace it with the proposed API, we need an easy way for implementors of IQueryable to create standard collectors.

This patch provides a CollectorFactor for creating collectors.
Comment 23 Ian Bull CLA 2009-10-30 16:04:10 EDT
Created attachment 150978 [details]
mylyn/context/zip
Comment 24 John Arthorne CLA 2009-10-30 19:25:28 EDT
(In reply to comment #21)
> Collector IQueryable#query(IQuery query, int numberOfResults, boolean
> suppressDuplicates)

I don't really like this signature, because it replaces something very generic and flexible with something that supports only very narrow use cases. This supports the form of short-circuit that says "Give me the first N elements", but not more generic short-circuits like "Give the first blue element", or "Keep giving me elements until you hit a blue one". Another option would be some kind of simple visitor object that could "visit" each result as it becomes available and have the option to short-circuit the query. Null could be passed in simple cases where it wasn't needed.

Also, I noticed you dropped the progress monitor - don't you view queries as something potentially long running? If there is any possibility of contacting a server it would need a monitor for reporting progress and to support cancelation.
Comment 25 Ian Bull CLA 2009-10-30 22:07:33 EDT
(In reply to comment #24)
> (In reply to comment #21)
> > Collector IQueryable#query(IQuery query, int numberOfResults, boolean
> > suppressDuplicates)
> 
> I don't really like this signature, because it replaces something very generic
> and flexible with something that supports only very narrow use cases. This
> supports the form of short-circuit that says "Give me the first N elements",
> but not more generic short-circuits like "Give the first blue element", or
> "Keep giving me elements until you hit a blue one". Another option would be
> some kind of simple visitor object that could "visit" each result as it becomes
> available and have the option to short-circuit the query. Null could be passed
> in simple cases where it wasn't needed.

Thanks for the feedback John.  This is definitely a first shot at this API, so the feedback is much appreciated.  

The whole idea behind removing the collectors is to move the "query logic" into the query.  For example, if you wanted to "get elements until you hit a blue one", you would write this as a query.  The big advantage is that those implementing IQueryable are free to create their own queries that do the same logic, and don't have to worry about running each element in their set through a visitor. (In a large DB this might be infeasible).  

I could argue that both numberOfResults and suppressDuplicates could be removed from signature, and the queries should implement these things. The reason I put these two arguments in the signature is because they appear to be the two reasons people use custom collectors, and they are likely the types of things IQueryable implementors could optimize for.


> 
> Also, I noticed you dropped the progress monitor - don't you view queries as
> something potentially long running? If there is any possibility of contacting a
> server it would need a monitor for reporting progress and to support
> cancelation.

I certainly did not mean to remove the progress monitor.  Thanks for picking up on this.
Comment 26 John Arthorne CLA 2009-11-02 10:17:42 EST
I went through a few iterations of this in my head this weekend (not in HEAD ;)

First I realized as you mention that 'find the first blue element', etc, are really just queries. I then thought maybe PipedQuery could implement this logic by passing in two separate queries ("select blue items", and "give me the first one"). I thought PipedQuery could pass the items through the pipe lazily so that if the Nth query stopped asking for items we would no longer provide items to query N-1. This didn't work because certain kinds of queries need to see everything before they can return the correct answer (like "find greatest").

Next I thought, perhaps the "result size" and "allow duplicates" could be generic properties that could be added to any query. I.e., I could create an InstallableUnitQuery, then call query.setProperty(MAX_RESULTS, "5"). This would be quite powerful, but would require all IQueryable implementations to know about these properties and their meaning. So, we would never be able to add more properties in the future because existing IQueryables wouldn't process them correctly.

My current line of thinking is that we could create a new query type, something like "LimitQuery", much like we have the MatchQuery. PipedQuery could optimize for the case where there is a MatchQuery followed by a LimitQuery (stop passing inputs to the MatchQuery once the LimitQuery says it is done). This fullfills the general Query contract so all existing IQueryables would continue to work. And, it allows the IQueryable API to simply become:

  query(Query query, IProgressMonitor monitor);

I haven't implemented this, but LimitQuery would look something like this:

public abstract class LimitQuery implements Query {

	public Collector perform(Iterator iterator, Collector result) {
		while (iterator.hasNext()) {
			Object next = iterator.next();
			result.accept(next);
			if (isDone(next))
				return result;
		}
		return result;
	}

	/**
	 * Returns whether the given candidate should be the final result
	 */
	public abstract boolean isDone(Object candidate);

}

Alternatively, instead of being abstract, LimitQuery could by default take an integer limit on the size of the result, and subclasses could override to perform more complicated short-circuit conditions.

Does that make sense? What I'm trying to find is a solution that not only satisfies our existing usage, but is general and flexible enough that it might also handle tomorrow's usage as well. Plan B is that we just switch over to SQL ;)
Comment 27 Ian Bull CLA 2009-11-02 13:48:06 EST
Plan B it is then :)

Yes, the idea of a limit query would work, or basically a query that includes the "when to stop visitor".  Currently this is part of the collector, but it could be designed like you have it.

The problem with this particular use case is that there is no guarantee (that I know of) on the order in which the items will be processed by a query.  So the idea of using a collector / query to continue finding items until the first blue one is found, would not necessarily produce consistent results.   The first time you run it, the blue item might be last, but the second time it might be first.  What would be more likely would be:

Order the elements as they appear in a rainbow, and continue until the first blue one is found.  But this is really a query.

While this example is getting a little abstract, I think it actually demonstrates some of the problems with the current approach too.
Comment 28 John Arthorne CLA 2009-11-02 14:47:55 EST
I don't think clients can ever rely on query order. If they want the first "N" elements that match a particular query, they won't necessarily get a consistent answer across multiple queries, and that's just something they have to live with. If they really care which "N" elements they get, the only option is to query over everything and then process the result (not short-circuiting at all).
Comment 29 Ian Bull CLA 2009-11-09 22:32:26 EST
I explored moving these two parameters, number of results and set vs. list, to the query, and while it can be done I'm not sure it's the best approach for two reasons.

1. In a few places we reuse queries (like the InstallableUnitQuery.ALL), and adding parameters like number of results means we either have to do this at instantiation time, or make these parameters mutable.  The other option is we combine the queries (ALL + get first N), however this leads into problem #2.

2. Combining queries that define different cardinalities leads to all sorts of headaches.  For example, AND the results of queries where some want red balls, some want white balls, some are collecting all elements, while others only want the first 2.  Some want to be collected as sets while others lists.

Cardinality and Set/List are really parameters on a particular run of a query. These really need to be fed into the query at execution time. This time collect white or red balls, return as a set and stop when you get 1.

If adding these things to the IQueryable#query API is too restricting, we could do something like

IQueryable#query(IQuery, Map, IProgressMonitor).

The map would define these execution time properties.  For now, NUMBER_OF_RESULTS, and SUPPRESS_DUPLICATES would be defined.

This works well with the transparent queries we are also taking about, since these properties can easily be inspected and optimized for.
Comment 30 Ian Bull CLA 2009-12-08 02:01:19 EST
Fixed in the branch.  We now have the following API for IQueryable

IQueryable#query(IQuery query, IProgressMonitor monitor).
Comment 31 Ian Bull CLA 2009-12-08 02:10:59 EST
Marking 3.6 M5.