Bug 373044 - [transport] Implement support for HTTP caching shared by all installations and instances on a system
Summary: [transport] Implement support for HTTP caching shared by all installations an...
Status: NEW
Alias: None
Product: Equinox
Classification: Eclipse Project
Component: p2 (show other bugs)
Version: 3.8.0 Juno   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: P2 Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords: helpwanted
Depends on:
Blocks: 381598
  Show dependency tree
 
Reported: 2012-03-02 02:17 EST by Gunnar Wagenknecht CLA
Modified: 2015-10-24 14:26 EDT (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gunnar Wagenknecht CLA 2012-03-02 02:17:08 EST
Today p2 relies on timestamps to detect if a metadata file needs to be downloaded again or not ([composite] artifact/content xml/jar, p2.index).

In addition to the timestamp logic the p2 HTTP transport should also respect the HTTP protocol. The HTTP protocol allows to signal a cachability which avoids further client requests until the cache time really expires.

For the benefit of maximum cache hits, the p2 HTTP transport should use a system wide internal cache location. The HttpClient already provides an in-memory storage which does not survive application restarts. But it would be a first step in avoiding unnecessary server round-trips while Eclipse is running.

This single cache should be used for all p2 HTTP operations (no matter which profile is involved). Thus, PDE with it target platforms will also re-use it.
Comment 1 Gunnar Wagenknecht CLA 2012-03-02 02:19:03 EST
AFAIK the HTTP transport is currently implemented by ECF. I wonder if ECF already provides support of this out of the box and just needs to be configured. Scott, can you comment?
Comment 2 Scott Lewis CLA 2012-03-02 10:32:22 EST
p2 does use the ECF transport, and the current provider impl is based upon apache httpclient 3.1. 

It's currently possible to get/set request headers for the http request...so if this is all that's necessary to do what you wish then it could be done without further modification to ECF.
Comment 3 David Williams CLA 2012-03-02 10:39:01 EST
Did you mean to set "security advisory" flag?
Comment 4 Gunnar Wagenknecht CLA 2012-03-02 13:48:11 EST
Scott, it's not just HTTP headers. The HttpClient needs to be wrapped into a CachingHttpClient and a suitable cache storage needs to be added.

BTW, I reverted the committer-only group.
Comment 5 Scott Lewis CLA 2012-03-02 13:53:45 EST
(In reply to comment #4)
> Scott, it's not just HTTP headers. The HttpClient needs to be wrapped into a
> CachingHttpClient and a suitable cache storage needs to be added.

Ok...then it would/will take an enhancement to the existing provider, or the creation of a new provider.  Probably not technically difficult...if httpclient already has support for it...but resources and/or contributions required.
Comment 6 Matthew Piggott CLA 2012-03-02 14:15:12 EST
Would the in memory cache provide any benefit?  Its been a while but I don't remember repositories being unloaded.
Comment 7 Gunnar Wagenknecht CLA 2012-03-02 14:28:27 EST
(In reply to comment #6)
> Would the in memory cache provide any benefit?  Its been a while but I don't
> remember repositories being unloaded.

p2 still does a HTTP request to verify timestamps. In case of the in memory cache those requests could be avoided too (depending on proper cache headers sent by the server).
Comment 8 Matthew Piggott CLA 2012-03-02 14:33:55 EST
Right, but I believe that is only the first time the repository is accessed until shutdown which would still be the case if http client is using an in memory cache.
Comment 9 Gunnar Wagenknecht CLA 2012-03-02 17:33:51 EST
(In reply to comment #8)
> Right, but I believe that is only the first time the repository is accessed
> until shutdown ...

According to bug 347448 comment 23 this is not the case:

> BTW -- for how long does p2 cache these results?  I ran a Check for Software
> updates 3 hours ago, and again just now, and it returned to the server for all
> the goodness above.
Comment 10 Pascal Rapicault CLA 2012-03-02 17:49:27 EST
(In reply to comment #9)
> (In reply to comment #8)
> > Right, but I believe that is only the first time the repository is accessed
> > until shutdown ...
> 
> According to bug 347448 comment 23 this is not the case:
> 
> > BTW -- for how long does p2 cache these results?  I ran a Check for Software
> > updates 3 hours ago, and again just now, and it returned to the server for all
> > the goodness above.

If you read as far as bug 347448 comment 24 you will see that the assertion you refer to is incorrect and that p2 does cache the files, and then perform HEAD requests to get more information.
Comment 11 Pascal Rapicault CLA 2012-03-02 18:05:33 EST
Now back to the original request, could you please describe in details the situation that will be helped by using http client caching?
Comment 12 Ian Bull CLA 2012-03-02 23:33:51 EST
I was under the impression (and like usual, I could be wrong), that checking timestamps and checking if the HTTP Cache is up-to-date, is the same amount of work. That is, a single request-response cycle is required.  Unless of couse we want to use maintain a minimal update check-time.  However, if we do this, we need to be careful since this might affect fetching of build artifacts in ways we don't expect.
Comment 13 Gunnar Wagenknecht CLA 2012-03-03 03:07:07 EST
(In reply to comment #10)
> If you read as far as bug 347448 comment 24 you will see that the assertion you
> refer to is incorrect and that p2 does cache the files, and then perform HEAD
> requests to get more information.

Pascal, thanks for confirming the additional HEAD requests. Those requests can be avoided by using the caching capabilities offered by the HTTP protocol and that's what this bug report is about.


(In reply to comment #12)
> I was under the impression (and like usual, I could be wrong), that checking
> timestamps and checking if the HTTP Cache is up-to-date, is the same amount of
> work. That is, a single request-response cycle is required.

No. If the response includes an 'Expire' or 'Cache-Control: max-age' header the client can calculate a duration where further requests to that resource can be avoided. Even if a resource is then 'stale' clients can do a conditional GET which only downloads the file again if it really changed.

http://code.google.com/intl/en/speed/articles/caching.html
Comment 14 Gunnar Wagenknecht CLA 2012-03-03 03:18:10 EST
(In reply to comment #11)
> Now back to the original request, could you please describe in details the
> situation that will be helped by using http client caching?

One situation are the many p2.index file requests. Those files change rarely. They can be cached on the client with a very long Expire header. No further requests will be made to fetch them.

Same is true for the many p2 repositories out there. For example, for the release train composite repos it's possible to pick a reasonable Expire header. We do a lot planning and we know well ahead of time when the repo will change again (eg. SR1, SR2). 

Many consumed repos (eg. repos of a specific build) will never change again. That basically allows to cache metadata (p2.index, content/artifact) "forever" (eg. years or months). Ideally, when resolving a target platform a second time no requests will ever hit the wire again. The metadata comes out of the HTTP cache.

With Git I can happily work disconnected and off-line. PDE has some issues with resolving target platforms sometimes when disconnected. Such a cache will help here as well. 

Another really big benefit of using a system wide cache location would be to re-use already downloaded metadata between multiple workspaces and/or installations.

There may be a few more benefits/scenarios.
Comment 15 Ian Bull CLA 2012-03-03 20:06:08 EST
(In reply to comment #13)
> (In reply to comment #12)
> > I was under the impression (and like usual, I could be wrong), that checking
> > timestamps and checking if the HTTP Cache is up-to-date, is the same amount of
> > work. That is, a single request-response cycle is required.
> 
> No. If the response includes an 'Expire' or 'Cache-Control: max-age' header the
> client can calculate a duration where further requests to that resource can be
> avoided. Even if a resource is then 'stale' clients can do a conditional GET
> which only downloads the file again if it really changed.
> 
> http://code.google.com/intl/en/speed/articles/caching.html

Thanks Gunnar,

Right now we essentially do the 'conditional get' (we check the timestamp with single request-response cycle, and if the timestamp has changed, we D/L the new file). What we're missing is the 'Expire or Cache-Control' headers. I do think those are interesting, and would reduce the amount of up-to-date checks we do. On the other hand, we need to 1. ensure we can set this per repository, and 2. figure out what the 'right' values are?  

Regarding number 1, do you think this would be something set in the content/artifacts.jar file?  Or is there a cache-control file that woud live in the repository.

Regarding number 2, this is tricky.  While we only update our main repos 3 times per year (and thus we could set the Expire value to about 4 months), what if we need to push out an emergency fix?  I would cation setting this anything longer than about 1 week.
Comment 16 Gunnar Wagenknecht CLA 2012-03-04 01:19:47 EST
(In reply to comment #15)
> Regarding number 1, do you think this would be something set in the
> content/artifacts.jar file?  Or is there a cache-control file that woud live in
> the repository.

It's something Denis and his team would have to set for us in the web server configuration.

> I would cation setting this anything
> longer than about 1 week.

A weak is fine. My expectation is that even 48 hours would already bring a noticeable effect for Eclipse servers.

-Gunnar
Comment 17 Thomas Hallgren CLA 2012-03-04 02:25:48 EST
(In reply to comment #16)
> > I would cation setting this anything
> > longer than about 1 week.
> 
> A weak is fine. My expectation is that even 48 hours would already bring a
> noticeable effect for Eclipse servers.
> 
Why not set the expire header for the p2.index relatively short (say 24 hours or so) and then go from there. If the timestamp of the p2.index file is changed, then p2 invalidates the http cache for the rest before reading. That would mean that the expire header for the rest of the repository could be infinite.
Comment 18 Gunnar Wagenknecht CLA 2012-03-04 03:53:28 EST
(In reply to comment #17)
> That
> would mean that the expire header for the rest of the repository could be
> infinite.

The Expire header is also understood by proxies in between. Thus, any corporate proxy might cache the repo infinitely then and never go back to the origin.
Comment 19 Thomas Hallgren CLA 2012-03-04 04:25:46 EST
(In reply to comment #18)
> The Expire header is also understood by proxies in between. Thus, any corporate
> proxy might cache the repo infinitely then and never go back to the origin.

And there's no way to send a HTTP request that will force a get from the origin?
Comment 20 Thomas Hallgren CLA 2012-03-04 04:48:45 EST
There seems to be ways to force the proxy to revalidate. See

http://www.faqs.org/rfcs/rfc2616.html secion "14.9.4 Cache Revalidation and Reload Controls"
Comment 21 Scott Lewis CLA 2012-03-04 13:50:09 EST
(In reply to comment #13)
> (In reply to comment #10)
> > If you read as far as bug 347448 comment 24 you will see that the assertion you
> > refer to is incorrect and that p2 does cache the files, and then perform HEAD
> > requests to get more information.
> 
> Pascal, thanks for confirming the additional HEAD requests. Those requests can
> be avoided by using the caching capabilities offered by the HTTP protocol and
> that's what this bug report is about.

It is really just about avoiding the HEAD requests...or is there more to it than that?  Because it doesn't seem to me likely that these requests are responsible for a huge amount of traffic...or is there evidence that in practice there is a lot of traffic generated by these (i.e. client delay and/or server load)?

What I'm getting at is...given development, testing, configuration, tuning costs...do you have some info about what using the http cache will actually accomplish? ...and are there other alternatives that might have same/more 'bang for the buck'?
Comment 22 Gunnar Wagenknecht CLA 2012-03-04 16:55:28 EST
(In reply to comment #21)
> It is really just about avoiding the HEAD requests...or is there more to it
> than that?

It's not just HEAD. It's also the p2.index file. And just to quote Denis post to cross-project:

> 250 bytes is small, but when you scale it up to eclipse.org proportions,
> things start getting real.

In this example he was talking about 250 bytes of comments in a p2.index files which translates to 1.4GB of internet transfer for eclipse.org PER DAY.
Comment 23 Thomas Hallgren CLA 2012-03-04 17:34:30 EST
(In reply to comment #22)
> > 250 bytes is small, but when you scale it up to eclipse.org proportions,
> > things start getting real.
> 
> In this example he was talking about 250 bytes of comments in a p2.index files
> which translates to 1.4GB of internet transfer for eclipse.org PER DAY.

If it does, then that amounts to 5.6 million repository requests per day since there's only one request for the p2.index file each time the repository is downloaded.

The Indigo repository, with it's three versions of > 2MB data is roughly about 7 MB (UI metadata only). 250 bytes of that is roughly 0.0035 percent. Also, the 250 byte comment is extremely likely to go into the same TCP package as the rest of the very small file and thus, cause no overhead at all. So why are we even discussing this?

I agree with Scott when he requests evidence. I'm fairly certain that no test case will prove that the 250 bytes of comment in the header will make a noticeable difference.

What I'm fairly certain would make a real difference is efficient HTTP caching and to use a different approach when publishing. The vast majority of users will only need the latest version, not a composite where everything is tripled.
Comment 24 Gunnar Wagenknecht CLA 2012-03-05 01:14:48 EST
(In reply to comment #23)
> If it does, then that amounts to 5.6 million repository requests per day since
> there's only one request for the p2.index file each time the repository is
> downloaded.

According to Denis, Eclipse.org received 6,017,804 requests for various p2.index files on March 1st. Out of which 1,765,385 hit /releases (or some folder below).

> What I'm fairly certain would make a real difference is efficient HTTP caching

Ok, sounds like we are on agreement here. That's exactly the point of this bug report.
Comment 25 Scott Lewis CLA 2012-06-06 12:05:06 EDT
Just to be clear:  

1) I would like more evidence (i.e. p2 test and measurements) about the where the real problems are before I personally spend significant effort on this.  As an example of where evidence can show surprising things when it comes to efficiency and transport...please see https://bugs.eclipse.org/bugs/show_bug.cgi?id=297742#c49

2) If we come to agreement that http caching will make a real difference for p2, then from my chair as project lead, resources for doing this is the only real problem.  Someone (else) has to step up here.  If corp member and foundation really want it, then please make it happen.  We (ECF) cannot shoulder this one ourselves simply because we would like to.
Comment 26 Denis Roy CLA 2012-06-11 09:58:52 EDT
> Just to be clear:  
[snip]
> 2) If we come to agreement that http caching will make a real difference for
> p2, then from my chair as project lead, resources for doing this is the only
> real problem.  Someone (else) has to step up here.  If corp member and
> foundation really want it, then please make it happen.  We (ECF) cannot
> shoulder this one ourselves simply because we would like to.


In other words, a key piece of technology, which is used by the entire Eclipse.org community and whose functionality can potentially affect the way Eclipse.org operates, was put into Eclipse, and now there are little to no resources available for improving or maintaining it?
Comment 27 Scott Lewis CLA 2012-06-11 10:26:44 EDT
(In reply to comment #26)
> 
> In other words, a key piece of technology, which is used by the entire
> Eclipse.org community and whose functionality can potentially affect the way
> Eclipse.org operates, was put into Eclipse, and now there are little to no
> resources available for improving or maintaining it?

What we're discussing on this bug is an enhancement, so I wouldn't call this maintenance.  There are and have been resources for maintaining (and actually enhancing) the filetransfer part of p2...that ECF is responsible for...and I know this has been and is true for p2 as well.  WRT ECF, I know this for sure, as I'm basically speaking of myself and other ECF contributors/committers.  If you would like evidence, then please consult

https://bugs.eclipse.org/bugs/buglist.cgi?list_id=1913139;classification=RT;query_format=advanced;bug_status=RESOLVED;bug_status=VERIFIED;bug_status=CLOSED;component=ecf.filetransfer;product=ECF

However, I am in essential sympathy for your point...that p2 (and by extension ECF) represents a key piece of technology for the community...and both p2 and ECF are starved for resources to address this and other issues that may affect everyone (since everyone uses install/update and eclipse.org).  

If you, p2, contributors, the Eclipse Foundation, interested parties on this bug, or any of the member companies can provide such resources then there's no problem at all (we're not talking man-years here, btw).  For this enhancement, I personally can't currently commit to providing the resources by myself (I will, of course, participate and assist)...and myself is all that I can currently speak for.

Believe me, I don't like that situation...but other than marshaling as many resources from the community as I can...by myself...as was done for bug 297742 and others...I don't know what to do about it.  I'm open to ideas.
Comment 28 Denis Roy CLA 2012-06-11 11:02:04 EDT
Thanks, Scott.  I understand the situation.

If nothing else, at least there's no shortage of sympathy anywhere  :-)
Comment 29 Scott Lewis CLA 2012-06-11 14:36:27 EDT
(In reply to comment #28)
> Thanks, Scott.  I understand the situation.
> 
> If nothing else, at least there's no shortage of sympathy anywhere  :-)

Yes...my/our position is sadly not uncommon.  But unfortunately sympathy doesn't pay for bandwidth...or for committer time.

What's needed, I believe...is a sympathy-to-contribution conversion utility :).  Seriously...it seems to me that bugzilla could help out there.
Comment 30 Scott Lewis CLA 2013-03-12 16:39:24 EDT
For everyone's info.

For p2 Kepler M6, ECF has contributed a new filetransfer provider based upon Apache httpclient 4  ...see bug 337449.   

Apache Httpclient 4 has built in support for http caching and what they call 'conditional compliance' with RFC-2616:

http://hc.apache.org/httpcomponents-client-ga/tutorial/html/caching.html

The new ECF provider that uses httpclient4 is here:

http://git.eclipse.org/c/ecf/org.eclipse.ecf.git/tree/providers/bundles/org.eclipse.ecf.provider.filetransfer.httpclient4

As yet, this provider does not use the built-in caching.  Before working on new features I want to be very certain that the new provider meets all the existing Eclipse/p2 use cases (e.g. proxies, credentials handling, etc).

If this testing goes reasonably well over the next few months...and we get some assistance with testing and debugging this new provider, we would be in a position to look at adding support for http caching.  We will, however, have to ask for some development (and testing) contributions...probably from people cc'd on this bug.  I will commit to supporting any effort, but I still can't commit to lead it (given the commitment I communicated above).

I would like to see a community effort started at EclipseCon.  I am not able to attend the conference this year, so I would appreciate someone taking on marshaling resources for this effort and communicating about it on this bug.
Comment 31 Pascal Rapicault CLA 2013-03-12 21:49:11 EDT
Scott before you and others start to invest some time in adding caching to ECF httpclient4, the impact of such change on p2 needs to be understood. Is it transparent for p2, does the cache need to be flushed, would all downloads be cached, etc. 

Until we have a general understanding of this, I think it is to premature to start any coding.
Comment 32 Scott Lewis CLA 2013-03-12 21:59:37 EDT
(In reply to comment #31)
> Scott before you and others start to invest some time in adding caching to
> ECF httpclient4, the impact of such change on p2 needs to be understood. Is
> it transparent for p2, does the cache need to be flushed, would all
> downloads be cached, etc. 
> 
> Until we have a general understanding of this, I think it is to premature to
> start any coding.

Hi Pascal.  Sure...I agree.  I'm not going to start any coding until we coordinate with the p2 team.  

My own reading of the httpclient 4 cache docs (link) is that it aims to be a transparent cache...which does have to be understood relative to the existing p2 mechanisms before introduced.  So don't worry :)...I won't start work on it or anything.  I just wanted to let all know on this bug...given that EclipseCon is approaching it might be a good time rally resources and/or discuss how could be used.

Also...I expect that if the cache mechanisms in httpclient 4 are eventually used, we (ECF) will want to at least provide some new API for configuring and controlling the cache...as needed for p2...but also possibly other clients.
Comment 33 Scott Lewis CLA 2014-01-29 14:57:59 EST
(In reply to Scott Lewis from comment #32)

> Also...I expect that if the cache mechanisms in httpclient 4 are eventually
> used, we (ECF) will want to at least provide some new API for configuring
> and controlling the cache...as needed for p2...but also possibly other
> clients.

Just revisiting this...ECF's httpclient4 (with http caching) is now the primary provider for p2...and we/ECF would be quite willing to add any desired cache config/management API to access the httpclient4 cache control at runtime/filetransfer time.

If we are going to do this, however, I would like some use case input...and/or discuss with the p2 folks to see how it might benefit p2/Eclipse.

For everyone's reference, here are the docs I've found for the HttpComponents caching:   https://hc.apache.org/httpcomponents-client-ga/tutorial/html/caching.html
Comment 34 Alex Blewitt CLA 2014-05-02 06:56:51 EDT
Note that ECF intentionally disables HTTP caching as noted in bug 410813 (caused by a 'fix' to disable caching in bug 249990).
Comment 35 Scott Lewis CLA 2015-03-06 10:31:17 EST
Gunnar:  Does your changing the subject text on this bug, as well as the recent activity on bug 461501 and bug 381598 convey that there is some renewed move afoot to use httpclient4-based caching for p2?

As per my previous comments on this bug, from my chair supported dev time remains the only barrier to this work.  There will have to be resources from both ECF and p2 technical perspectives, however, since comment 31 is still accurate/correct.
Comment 36 Scott Lewis CLA 2015-10-24 14:26:34 EDT
FWIW:   In bug 477524 there has also recently been an enhancement request to add an OkHttp-based provider.   Although I'm not certain, my guess is that OkHttp would not...out of the box...support http caching as httclient4 does, but perhaps there could be some coordinated effort to create a small adapter API for cache mgmt/control, in addition to implementing via the httpclient4-based provider.

Just to be clear, as I see it this would would consist of:

1) Figuring out what kind of cache control would be needed/best usable by p2 and Eclipse.
1a) Looking at the existing htpclient4 cache control APIs
2) Designing a small cache control API based upon 1 and implementing it as an adapter to the existing filetransfer API.
3) Implementing 2 on httpclient4 (1st) and any other providers (urconnection, OkHttp?)