Bug 239668 - [net] Support for download stats
Summary: [net] Support for download stats
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Cross-Project (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows XP
: P3 enhancement with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Cross-Project issues CLA
QA Contact:
URL:
Whiteboard:
Keywords: helpwanted
: 251907 (view as bug list)
Depends on:
Blocks: 187968
  Show dependency tree
 
Reported: 2008-07-04 16:53 EDT by Pascal Rapicault CLA
Modified: 2010-06-14 03:04 EDT (History)
26 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pascal Rapicault CLA 2008-07-04 16:53:51 EDT
> Joel Cayne: Do you know of any way for us to obtain any download stats through p2 usage or if there will be some method for this in future releases?

Whether p2 or UM based, the statistics of what is installed "after the initial download" have always been hard to gather. Here are a few reasons why:
 - given that a user could eventually install only one plug-in stats have to be gathered at all levels
 - the decision of what needs to be installed is completely made on the client, and once the user has selected something to install, the server where the metadata originated is never contacted again.
 - when the download is performed, the plug-ins are downloaded from multiple repositories (for load balancing and robustness) thus making it hard to know what is going on. That one could be worked around by having each download go through *the* main server which would then do a redirect, but that could be a bottleneck and would break the robustness of mirroring.

That said, I can think of two solutions: 
 - ask all the known mirrors to provide access to their server log and build stats from this. This is what requires the least work for me :)
 - develop a download monitoring bundle that would look at what is getting installed as it is being installed and send that information to a stat server. The way p2 is designed that could be implemented fairly easily on top of the current infrastructure.
Comment 1 Martin Oberhuber CLA 2008-07-07 09:31:15 EDT
Interesting approaches.

While the first one looks more like something an IT expert could do, the second one seems somehow related to the EPP Usagedata project and more versatile, since it could potentially also produce an instance count of plugins installed through 3rd party distros or commercial offerings based on Eclipse.

It even looks like the two approaches are somehow complementary:

  - "What is installed from the known Open Source Channels" vs.
  - "What is installed by people who agree to upload their install stats".

I'm wondering to what extent such download stats / installed instance stats would be interesting to the Eclipse Ecosystem in general? - Adding Wayne on CC.
Comment 2 Pascal Rapicault CLA 2008-10-23 19:15:58 EDT
*** Bug 251907 has been marked as a duplicate of this bug. ***
Comment 3 Alexei Goncharov CLA 2008-10-24 10:20:38 EDT
Hello all.

I beg you pardon, but I do have my own opinion on the situation and as a part
of the Eclipse Community I wont be shy to share it with you.

I think that the problem lays much deeper and the solutions proposed wont help
at all. Let's have a close look on the problem that brought us to this bug,
closing two more as duplicate. Really it relates to the mirroring
infrastructure and if we do want to solve it we do not need to wright more and
more plug-ins. The fact is that the solutions provided by Eclipse know nothing
about the content on the mirrors and the is the problem that must be solved. It
relates not only to gathering statistics. Even the old update manager proposed
(and even selected automatically) the broken mirrors (mirrors with old content,
mirrors without any content, etc) and the user always had a chance to have
problems with plug-ins installation, caused by the broken mirrors he select.
The only way to solve it is to improve the infrastructure allowing the
solutions to know the state of the mirrors, so that the update-manager even
wont try to connect to the broken mirror, but not connect to one, then to
another, another, another, another.....and fail gracefully.

>  - develop a download monitoring bundle that would look at what is getting
installed as it is being installed and send that information to a stat server.
The way p2 is designed that could be implemented fairly easily on top of the
current infrastructure.

Develop another plug-in? Great. And what will you do with ALL the client if
there will be some mistake in this plug-in? Or each user should have to
download nightly builds? Mm?
We have already got the p2 (please, don't feel hurt, p2 developer) in Eclipse
3.4. So? It had been under development with a lot of bugs in it even inside M7!
Even now I sometimes can not install plug-in updates with it, cause it always
shows me some cached info instead of the update-site real contents. Just today
I could not upgrade Mylyn at all, cause p2 told me that the repo is broken
after it SHOWN me the features to update, proposed the update and I clicked
"Install". So to upgrade it I had to download the archived update-site. But
Mylyn 3.0.3 release was not today. It's a well known fact that if you want to
install something you must wait for it to be replicated on the mirrors.
Developing some another plug-in wont solve the problem at all, but might bring
more, cause the problem discussed is not only related to statistics - it is
wider.

I know a lot of people who are using Europa, because it is (and even was at its
release time) much more stable than Ganymede. So my proposal, is to solve the
problems, that are persist, at first, and only than provide some new features.

Best regards, Alexei Goncharov.
Comment 4 Mike Milinkovich CLA 2008-10-24 11:14:57 EDT
I believe that the p2 team has aspirations to see p2 be used by the Eclipse Foundation as the infrastructure for shipping future Eclipse releases. I just want to make it clear that we will not be doing that until we get this issue resolved one way or another. There is no way that I could tell the Board and the membership that we are going to use a new technology that resulted in losing the ability to track and report download statistics. That just simply can not happen.
Comment 5 John Arthorne CLA 2008-10-24 11:33:41 EDT
> The only way to solve it is to improve the infrastructure allowing the
> solutions to know the state of the mirrors, so that the update-manager even
> wont try to connect to the broken mirror, but not connect to one, then to
> another, another, another, another.....and fail gracefully.

If I understand this sentence correctly, you are recommending exactly what p2 does now. p2 maintains information about each mirror it contacts (MirrorSelector.MirrorInfo class). It remembers the failure count, and the throughput on each mirror. Mirrors are sorted so that mirrors with zero failures are always preferred over failing mirrors. The secondary sort criterion is mirror throughput. It then runs a loop, picking the best mirror and retrying until success or until there are no more mirrors with a failure count of zero (across all downloads). Is this what you were suggesting? If not, please clarify.
Comment 6 John Arthorne CLA 2008-10-24 13:41:41 EDT
Re comment #4: I thought the aspirations came from elsewhere (bug 243523), but it's understood that this is an important request.

On a technical level, this is fairly easy to implement. P2 fires an event at the end of an install, and the event describes exactly what was installed. This enables us to give more accurate download stats than what we had in the past (which were based on *start* of a download and therefore didn't account for retries, failures, etc).

The more complicated aspect is privacy, and policies around when and where the download stats should be sent. Clearly if a user in company X downloads an application from their own company's internal server, it's nobody else's business. We were wondering if UDC might be part of the solution here, since UDC has already worked out some of these privacy and collection policy issues. I believe it would be quite easy to add an extra event to the UDC that captures the p2 "install complete" event and sends the data along with all the other data it captures. What I'm not sure about is whether this would be sufficient for everyone's needs (notably not everything contains UDC, and you wouldn't be able to gather stats on the install operation in which UDC was added to the system).
Comment 7 Denis Roy CLA 2008-10-24 13:54:35 EDT
> The more complicated aspect is privacy

If the tracking policy and location is defined within the p2 repository, doesn't that solve the problem?  ie, in the Ganymede repo (the Ganymede Update Site), we can safely track all its related downloads to www.eclipse.org, no?

The Foundation is willing to write new scripts that will accept the tracking payload from p2, or tweak the existing download.php.
Comment 8 John Arthorne CLA 2008-10-24 14:08:38 EDT
> If the tracking policy and location is defined within the p2 repository, doesn't that solve the problem?

Many companies make private replicas of the eclipse.org repositories for running internal builds or serving their employees or customers. The tracking URL could potentially be stored in the repository if there was an obvious way to remove/change it when making a private or commercial mirror.
Comment 9 Denis Roy CLA 2008-10-24 14:20:26 EDT
I recommend placing the tracking URL in the same file or location the mirrors URL is specified. I'd assume a private or commercial mirror doesn't use the Eclipse.org mirrors, so the download tracking could be changed in a similar manner.
Comment 10 Alexei Goncharov CLA 2008-10-27 05:02:31 EDT
(In reply to comment #5)
 
> If I understand this sentence correctly, you are recommending exactly what p2
> does now. p2 maintains information about each mirror it contacts
> (MirrorSelector.MirrorInfo class). It remembers the failure count, and the
> throughput on each mirror. Mirrors are sorted so that mirrors with zero
> failures are always preferred over failing mirrors. The secondary sort
> criterion is mirror throughput. It then runs a loop, picking the best mirror
> and retrying until success or until there are no more mirrors with a failure
> count of zero (across all downloads). Is this what you were suggesting? If not,
> please clarify.
> 

So you are proving, that all the mirrors are always contacted. But the most reliable are the first to contact to. Am I right?
The question is, if it is so, why have I got a message about a broken repository when trying to update Mylyn within 3.4? In 3.3 I could do it really easy selecting the Canadian main server as a mirror to download from.
So, if all the mirrors should be contacted in a loop, why the main server was not contacted by p2? If the loop should be run until success it won't happen.
Also I have another question on this "loop" over mirrors. Does this loop only waiting for server response or checks if the data stored there is valid? Cause the server can response but have an old content which is different from the main download area one. Yes, maybe I have a mirror in Ukraine, it is close to me, I have a great ping with it, there were no connection mistakes at all, but it stores the content with a week latency, for example. What will happen with p2?
Comment 11 Alexei Goncharov CLA 2008-10-27 05:05:56 EDT
(In reply to comment #6)
> On a technical level, this is fairly easy to implement. P2 fires an event at
> the end of an install, and the event describes exactly what was installed. This
> enables us to give more accurate download stats than what we had in the past
> (which were based on *start* of a download and therefore didn't account for
> retries, failures, etc).

Yeah, it is really good that p2 architecture is quite flexible, but can you please answer the question about a probable mistake in the code of that plug-in from comment #3...
Comment 12 Martin Oberhuber CLA 2008-10-27 07:15:07 EDT
(In reply to comment #4)
> resolved one way or another. There is no way that I could tell the Board and
> the membership that we are going to use a new technology that resulted in
> losing the ability to track and report download statistics. That just simply
> can not happen.

Mike, have there been any discussions about how to interpret download statistics? Note that already now, the statistics are covering only a small part of the many ways to get Eclipse:

   * Eclipse pre-installed on Linux Distros
   * Member distros such as Yoxos, ...
   * Any updates acquired through P2

are all not covered by download stats. This might perhaps be less of an issue for Core Platform which you can't get via Update Manager; but for all the other projects, it is very much of an issue. For my own project, I noticed with Europa that almost half the downloads were done via the Europa coordinated update site. In Ganymede, I could not count these any more (because of this issue), though I'd assume that the fraction of users acquiring stuff via the coordinated update site is growing since it's more convenient.

The download stats that I do currently have are not very valuable to me because I'm not sure how to interpret them. Going back to the two possible approaches (see comment #1), we should perhaps first define what we really want to achieve with the download stats, and what channels we're ok with not counting.
Comment 13 Ed Merks CLA 2008-10-27 08:07:53 EDT
Martin,

I agree. In the old days, it used to be easy to track the number of EMF downloads, but this has gotten more and more difficult over the years.  An example of something you didn't cover is things like WTP bundling EMF and the platform, i.e., all-in-one zips; such downloads impact the stats for both EMF and the platform.  

It seems to be a very big challenge to get truly meaningful statistics given all the different ways folks can get at the various plugins.  That makes me curious of what exactly is being used to track even just the downloads for the platform given that all-in-ones might well skewing that negatively...
Comment 14 Jeff McAffer CLA 2008-10-27 22:37:41 EDT
First we have to define what we want to track.  Knowing that a particular bundle was downloaded N times could be interesting but it would be more interesting to know that "some high level element of project X was installed N times" or even better, as John suggests, "function from porject X was *used* N times (by N people/...).  

The current stats are interesting because they rae based on concrete packagings of Eclipse that likely cover some huge percentage of the consumption (at least for tooling scenarios).  Losing that highlevel packaging abstraction will leave us with a whole whack of numbers to analyze and no apparent way to carry out the analysis.

In short, once we know what we want to count, we can design a system to count it.
Comment 15 Martin Oberhuber CLA 2008-10-28 15:22:57 EDT
When it comes to my project, I would actually already be happy if I just knew the number of downloads for each of my bundles via the P2 Repositories hosted at Eclipse or its official mirrors. Because for each of my "high level features" I can typically name a single bundle that would be chosen if and only if that high level feature is chosen. That's in fact the kind of number that Update Manager used to give me (with NickB's mirroring trick), and just re-instating that number would be helpful already.

This also has a pretty clear interpretation: How many people turn to Eclipse.org in order to get certain stuff produced at Eclipse. This number certainly plays a role in bandwidth and infrastructure decisions, or download site advertisements, even if we don't manage to get a number for Eclipse distribution via other channels such as member distros, in-house closed mirrors or Linux pre-installed.

I'm not sure how far we could ever go beyond that. I'm not a statistics expert, but here is an idea how the UDC data and download data could perhaps be combined, since they are orthogonal in some sense: 

If each UDC upload would include a field related to the channel how it's bundle was installed (P2-from-Eclipse / P2-from-unknown-mirror / downloaded-dropins / DistroX / Linux-preinstalled / BitTorrents / ...) we should be getting a random sample of install scenarios based on those folk who agree to do UDC upload. With some interpretation and statistic knowledge, that incomplete sample could perhaps be extrapolated based on the exact number of downloads for the channels that we can count... perhaps modulo the ratio of downloads-with-UDC versus downloads-without-UDC... in order to get some idea of the total distribution of Eclipse.

I'm wondering if / how it would be possible to include data about the provenience of an installation into what the UDC collects, but at the same time make sure that people can still check the digital integrity of a download by doing some checksum or comparing against an official distribution...
Comment 16 Nick Boldt CLA 2009-05-13 17:57:22 EDT
Any chance this might see the light of day before Galileo lets the beat.... drop?
Comment 17 Pascal Rapicault CLA 2009-05-13 21:46:26 EDT
No. 
Comment 18 Nick Boldt CLA 2009-05-13 23:31:48 EDT
I guess when you're required by everyone (ie., you're in the platform) it's less interesting to know who's downloading you. :(

Still, this would allow metrics re: who's upgrading from platform to platform vs. simply downloading a new tar.gz or zip and unpacking a fresh Eclipse instance.

You might also be able to track http: vs. https: vs jar: vs. file: usage to get a better sense of how people are using p2 to manage their installs. Do people actually care about zipped p2 repos? Or just the unpacked http: accessible ones? Do people still use dropins and link files, or just set up .eclipseextension folders and use p2 to install from there?

Surely this is valuable information to collect - it's a shame it won't make it for your 1.1 / 3.5 release, and we'll have to wait another whole year to start properly studying user activity.
Comment 19 Ian Bull CLA 2009-05-14 00:02:50 EDT
I don't think anybody is debating the value of this request, we are debating how to accomplish it.  The two proposals Pascal outlined in the bug description seem like the alternatives. Proposal one put the onus on the foundation (and mirrors) to track D/L.  John and others presented a number of challenges with this approach.

The second proposal put the tracking right in p2.  While obviously very accurate, it is just not feasible 4 weeks before a release. Assuming we had the resources to implement this right now, and assuming we got PMC approval to release it, IMHO, I would want legal / AC approval before putting a "Phone Home" device into all Eclipse installs.  

Martin's suggestion in Comment #15 (using the existing data collector, combining the two approaches and extrapolating the numbers via some statistical analysis), may be the most feasible. However, at this point nobody has stepped up with statistical background we will need to make that work (or the necessary patch to the UDC).

Nick, I think this is a good think to keep on people's radar.  
Comment 20 Nick Boldt CLA 2009-06-03 11:30:40 EDT
(In reply to comment #19)
> The second proposal put the tracking right in p2.  While obviously very
> accurate, it is just not feasible 4 weeks before a release.

How about for SR1 in September? This isn't really a "new function" - it's "infrastructure", so I don't think it violates the "no new features in a maintenance release" guideline. Besides, it benefits *everyone* immediately so I'd argue it's allowable. 
 
> Martin's suggestion in Comment #15 (using the existing data collector,
> combining the two approaches and extrapolating the numbers via some statistical
> analysis), may be the most feasible. However, at this point nobody has stepped
> up with statistical background we will need to make that work (or the necessary
> patch to the UDC).

Statistical background? All you need to do is record that org.eclipse.foo_version.jar was requested from mirror X a total of Y times. Then submit that back via the UDC to Eclipse. No analysis required, just data collection. Let the UDC people handle the analysis - the important thing is to get the data so there's something to analyse.

Or, if you want to get fancier, collect tabular data under these columns ...

  FULL_URL  JAR_NAME  VERSION  QUALIFIER  MIRROR_URL  NUM_REQUESTS

Since requested URLs include both mirror URL and /path/to/jarname_qualifier.jar, you could offload any string processing to the downstream UDC... in which case all you need is:

  URL  NUM_REQUESTS

Ian, Pascal: does that work?

Unless I'm mistake, that's exactly the form of data that the UDC already tracks, except that there it's classes and number of requests. So as long as the UDC can be tweaked to handle URLs instead, we'd be pretty much done. Wayne, does that make sense?
Comment 21 Martin Oberhuber CLA 2009-06-03 11:36:40 EDT
Hi Nick, my concern was that looking at mirror / download stats only we won't get any count of features distributed via other channels, such as

  * Linux Distros
  * 3rd party distros such as Yoxos, Pulse, MyEclipse, ...
  * Vendor products
  * Copying around a single person's download inside a company
   ... 

I'm not exactly sure how relevant those additional distribution channels are in addition to what people download from Eclipse.org and its mirrors; it's a different number and different statistical data.

My hope was that UDC statistics can be correlated with other known data to gain meaningful measures.

Now that being said, I fully agree that counting downloads from eclipse.org and its mirrors is definitely better than what we do today (no counting at all), so my comment about UDC and statistics should not be blocking that obviously required counting of downloads.
Comment 22 Nick Boldt CLA 2009-06-03 12:01:26 EDT
(In reply to comment #21)
> Hi Nick, my concern was that looking at mirror / download stats only we won't
> get any count of features distributed via other channels, such as
>   * Linux Distros
>   * 3rd party distros such as Yoxos, Pulse, MyEclipse, ...
>   * Vendor products
>   * Copying around a single person's download inside a company
> Now that being said, I fully agree that counting downloads from eclipse.org and
> its mirrors is definitely better than what we do today (no counting at all), so
> my comment about UDC and statistics should not be blocking that obviously
> required counting of downloads.

I'm not trying to solve the question of "100% planetwide tracking" - I just want to get back some metrics like we had in Eclipse 3.3, and lost thx to p2. This is a regression which needs to be repaired. 

Without metrics, how can we say how much people are using p2, vs. unpacking zips into dropins? We need these numbers to assess market penetration and to be able to start to looking at how to change behaviour patterns, if we discover that there's still a ton of traction in the "I hate p2, I just do it the old way" camp.
Comment 23 Ian Bull CLA 2009-06-03 12:45:58 EDT
(In reply to comment #20)
> 
> Ian, Pascal: does that work?
> 
> Unless I'm mistake, that's exactly the form of data that the UDC already
> tracks, except that there it's classes and number of requests. So as long as
> the UDC can be tweaked to handle URLs instead, we'd be pretty much done. Wayne,
> does that make sense?
> 

I'm not trying to pass the buck, (I am just not familiar with the UDC) but does this request become a UDC bug then?  If we rely on the UDC for counting D/L is there anything in p2 that needs to be done, or does the work live entirely in the UDC?
Comment 24 Denis Roy CLA 2009-06-03 13:41:03 EDT
Because the UDC is opt-in, relying on it for download stats is pointless.

IANAL, but for Eclipse to 'call home' and tell us what the user has downloaded is a potential privacy violation.  In previous versions, Eclipse simply fetched a core JAR file from eclipse.org, which is clickstream data similar to that of the data we collect when our users download ZIP files from our website.

Now, if p2 were to perform a checksum of the downloaded content and compare it to a hash found only on the master site (for security reasons), well, ....
Comment 25 Nick Boldt CLA 2009-06-03 15:21:28 EDT
(In reply to comment #23)
> (In reply to comment #20)
> I'm not trying to pass the buck, (I am just not familiar with the UDC) but does
> this request become a UDC bug then?  If we rely on the UDC for counting D/L is
> there anything in p2 that needs to be done, or does the work live entirely in
> the UDC?

P2 has to collect data in a form that can be transmitted back to the foundation.

This could be in the form of a csv file (url,count), or through some more complex API/handshake like submitting a bug via the Mylyn bugzilla editor. Or XMLRPC. I don't care what you use, but you have to:

a) collect data
b) submit data

Then, at the receiving end, UDC or a foundation server has to store the data (mysql database? like we use now for downloads processed by download.php), and later we can write tooling to query it and turn it into pretty reports and charts.

But if UDC sucks because it's opt in, and p2 doesn't want to store data due to some misguided desire to protect privacy, then we need another approach.

----------------

Could p2 fetch jars using the download.php URL instead of the direct mirror URL?

Thus, instead of fetching 

http://ftp.osuosl.org/pub/eclipse/modeling/emf/updates/milestones/features/org.eclipse.emf.all_2.5.0.v200906010115.jar.pack.gz

you would fetch 

http://www.eclipse.org/downloads/download.php?file=/modeling/emf/updates/milestones/features/org.eclipse.emf.all_2.5.0.v200906010115.jar.pack.gz&url=http://ftp.osuosl.org/pub/eclipse/modeling/emf/updates/milestones/features/org.eclipse.emf.all_2.5.0.v200906010115.jar.pack.gz&mirror_id=272

and the request would be tracked by the Eclipse Foundation servers.

Or, if you don't care what mirror you're accessing (eg., "give me the closest/fastest/best one"), then:

http://www.eclipse.org/downloads/download.php?file=/modeling/emf/updates/milestones/features/org.eclipse.emf.all_2.5.0.v200906010115.jar.pack.gz&r=1

would also get the job done and track the request (from whatever mirror the script chose).

Assuming you're determining the list of mirrors to use when fetching jars based on this:

http://www.eclipse.org/downloads/download.php?file=/modeling/emf/updates/milestones/features/org.eclipse.emf.all_2.5.0.v200906010115.jar.pack.gz&format=xml

then you're already using the same interface to fetch the mirror list - why not use that same API for fetching each jar too?

This solution has a number of benefits:

a) no privacy issues
b) no opt-in requirement
c) no locally stored data
d) reuses existing download.php interface to track data in existing database
e) requires no effort from UDC or Foundation webmasters
f) requires minimal (?) effort in p2

One caveat... I would make sure that this tracking is only turned on if the metadata references the download.php URL, eg., in content.xml or artifacts.xml, as it currently does for the EMF Milestones site in this example:

<property name='p2.mirrorsURL' value='http://www.eclipse.org/modeling/download.php?file=/modeling/emf/updates/milestones/&amp;format=xml'/>

That way p2 won't try to fetch jars on sourceforge.net from the eclipse.org download.php script. That'd be bad, mmkay? :)
Comment 26 Wayne Beaton CLA 2009-06-04 10:55:45 EDT
(In reply to comment #23)
> I'm not trying to pass the buck, (I am just not familiar with the UDC) but does
> this request become a UDC bug then?  If we rely on the UDC for counting D/L is
> there anything in p2 that needs to be done, or does the work live entirely in
> the UDC?
> 

Just for my own understanding, where is the hook that p2 uses to inform observers of it's activity? Given this, it should be simple to extend UDC.

But... Denis is correct. UDC is opt-in and so any stats it collects will be from a relatively small subset of our actual user base.
Comment 27 Paul Clenahan CLA 2009-06-12 19:36:10 EDT
(In reply to comment #22)

I don't have any input of the technical solution unfortuately but do want to add a "+1" on Nick's comments on the requirements.

While it is true that the tracking numbers we have historically collected do not include many of the distribution channels for Eclipse, they are a significant portion of the overall downloads.  More importantly, they provide enough to do reasonable trend analysis on market acceptance and penetration.  This worked well for the BIRT team through Eclipse 3.3 and has progressively become a problem as we have moved to the current p2 approach.

Comment 28 Wenfeng Li CLA 2009-06-14 02:37:36 EDT
+1 for Nicks solution in comment #25.  It is elegant.  Besides the benefits that Nick already listed, it also address the "chicken and egg" issue John mentioned in comment #3 (https://bugs.eclipse.org/bugs/show_bug.cgi?id=234515#c3).  Could it also help prevent outdated mirror list in client's content.jar cache.  The ability to use the closest/fastest/best mirror automatically woule be cool.
Comment 29 Pascal Rapicault CLA 2009-06-14 21:39:58 EDT
>Could it also help prevent outdated mirror list in client's content.jar cache.
  The mirror list is not directly stored in the content.jar
>The ability to use the closest/fastest/best mirror automatically woule be cool.
  This is already the case.


Comment 30 Wayne Beaton CLA 2009-06-15 14:42:36 EDT
(In reply to comment #24)
> Now, if p2 were to perform a checksum of the downloaded content and compare it
> to a hash found only on the master site (for security reasons), well, ....

I find the lack of response to this suggestion curious.

It seems to me that we can automate the server side of things. p2 will have to be modified to do this. It feels like something that can reasonably be rolled out as part of SR1. Of course, I state this without having any real knowledge of the inner workings of p2 or the kind of effort that will really be required.

It seems to me that it is in the downloader's best interest for us to validate that what they've downloaded from the mirror is what it reports to be. You could argue that access to eclipse.org would be required to install anything, but you could put some logic in p2 that presents the user with a warning along the lines of "we can't connect to the eclipse.org server" to verify the integrity of the downloaded components" should eclipse.org be inaccessible for any reason.

I wonder if there is anything "morally sketchy" about recording that we've been asked to verify the integrity of a particular file.
Comment 31 Wenfeng Li CLA 2009-06-15 15:12:36 EDT
(In reply to comment #29)
> >Could it also help prevent outdated mirror list in client's content.jar cache.
>   The mirror list is not directly stored in the content.jar
> >The ability to use the closest/fastest/best mirror automatically woule be cool.
>   This is already the case.

Thanks for the clarification.  Excuse me for some more followup questions:

1. Does P2 fetch the mirror list (or the best mirror) separatly from the
artifact.jar/content.jar each time a user wants to install a new feature(s)?   

I poke around in "preference" and "software update", and did not see any option
for user to select a mirror site,  does P2 automatically select a mirror for
the user?

The reason I ask this is that if P2 is getting the mirror info every time a
user try to install updates, maybe another solution is to add an URL parameter 
to indicate what top level feature catagory that user is installing. This way
the stats database can track project downloads by feature catagory.  

While I like Nick's solution better since it selects best mirror at the jar
file level,  piggyback a feature parameter in the mirror request might avoid
redirects for each jar download that some have expressed concern.  If not all
mirrors host all Eclipse project bits, this solution selects best mirror per
project that user wants to install.

  
Comment 32 Wenfeng Li CLA 2009-06-15 15:18:55 EDT
+1 to Wayne's point the need of verifing the integrity of the downloaded components from mirrors.
Comment 33 John Arthorne CLA 2009-06-15 15:39:51 EDT
>> Now, if p2 were to perform a checksum of the downloaded content and compare it
>> to a hash found only on the master site (for security reasons), well, ....

p2 already does this by comparing the downloaded artifact MD5's with the MD5 value stored in the artifacts.jar. The artifact.jar is only fetched once per client and cached.  It sounds like the suggestion is to change how we implement this so that it generates stats as a side-effect ;)

It sounds like the most interesting question is the legal/privacy problem. We're batting around possible implementations without really knowing what would fly from a legal perspective. We could introduce a "stats URL" on artifact repositories similar to how we currently have a "mirrors URL". The p2 download manager could do an HTTP GET on that URL after downloading any artifacts from that repository, with information like the install roots embedded in the URL. If the stats server wasn't available we would just carry on, so it wouldn't add any new point of failure. Since all servers track HTTP GET requests it doesn't seem any different from the stats that web servers routinely gather. These suggestions about using the mirror lookup or MD5 check to pass exactly the same information back to the server seem strange to me - either we are allowed to transmit that data back, or we aren't....
Comment 34 John Arthorne CLA 2009-06-15 15:43:20 EDT
> 1. Does P2 fetch the mirror list (or the best mirror) separatly from the
> artifact.jar/content.jar each time a user wants to install a new feature(s)?   

No, it happens roughly once per session. It's held in a cache in memory, so it's not quite that simple, but the mirror list is fetched at least once per session and possibly again if the cache is flushed.

> I poke around in "preference" and "software update", and did not see any option
> for user to select a mirror site,  does P2 automatically select a mirror for
> the user?

Yes.
Comment 35 Ian Bull CLA 2009-06-15 15:44:03 EDT
(In reply to comment #30)
> (In reply to comment #24)
> > Now, if p2 were to perform a checksum of the downloaded content and compare it
> > to a hash found only on the master site (for security reasons), well, ....
> 
> I find the lack of response to this suggestion curious.
> 
> It seems to me that we can automate the server side of things. p2 will have to
> be modified to do this. It feels like something that can reasonably be rolled
> out as part of SR1. Of course, I state this without having any real knowledge
> of the inner workings of p2 or the kind of effort that will really be required.
> 

We talked briefly about this on the p2 call today:

Tom pointed out that if this was introduced in SR1, then people would update to
SR1 (using Galileo) and the code to track the stats would be installed then. 
Then you could start to track stats at SR2.

However, this approach is unlikely to gain much traction since the checksums
are actually in the artifact.jar file, so users already have them on their
machine.  

John mentioned that we could possibly hit a URL after the artifacts were D/L,
and this URL could be used to track stats.  If the URL was unavailable, then
this would silently fail, but oh well.  The question is, is there anything
legally wrong with this.

1. Hit http://somemirror.com/somefile.jar (to get the file)
2. Hit http://eclipse.org/tracker/file=somefile.jar (to track the D/L)
(You might have to be mindful of bots)

(As software engineers, we often talk anecdotally about these things, but I
think we need some advice / direction from those who really know (and
understand the different laws that apply in different countries regarding web
stats)).  

Note:  We could store the tracker URL as a property of the artifact descriptor.
 If someone wanted "internal" mirrors with no callback, the property would be
removed.  This is important since we obviously don't want to hard-code
eclipse.org in p2.
Comment 36 Nick Boldt CLA 2009-06-15 15:59:10 EDT
> (As software engineers, we often talk anecdotally about these things, but I
> think we need some advice / direction from those who really know (and
> understand the different laws that apply in different countries regarding web
> stats)).  

If it's legal (and violates no privacy laws) to track stats for zips downloaded from eclipse.org mirrors via the http://www.eclipse.org/downloads/download.php?file=/path/on/mirror/to/file.zip -- which we already do, and wrap that with Google Analytics tracking to boot -- then why should there be any more legal or privacy issues around tracking *other* downloaded archives, namely .jar and .jar.pack.gz files?

   To paraphrase an oft-recited mantra in Modeling... an archive is
   an archive is an archive... or, a download stat tracker entry is
   a download stat tracker entry is a download stat tracker entry. :) 

The bottom line here is that we already do this, just not widely enough for valid statistical analysis, like we had two years ago for Eclipse 3.3 zips and jars. 

So I see two technical questions to solve:

a) can p2 use http://www.eclipse.org/downloads/download.php?file= to fetch jars instead of getting them directly from the mirrors? 

p2 already does this to fetch the mirrors list for a given update site, then fetches files based on that list directly from the mirrors. All that would be required is a patch to say "if you got the mirrors list from Eclipse, use download.php to fetch content.jar, artifacts.jar, and all the actual feature/plugin jars; else, fetch directly from the mirror since the bits you want are not hosted @ eclipse.org"

b) will the added server traffic cause problems @ eclipse.org in terms of bandwidth or mysql database inserts? 

Comment 37 Pascal Rapicault CLA 2009-06-15 16:19:53 EDT
Over private emails and phone conversations, I have suggested an approach to Wayne and Denis to address this problem that requires no p2 code change. It is an adaptation of the old UM trick adapted to p2 where the artifacts.xml knows to go to the foundation to get a particular file.
Moving to the simulateneous bucket. If there is anything to be done in p2, please open a specific bug.
Comment 38 Wenfeng Li CLA 2009-06-15 18:17:45 EDT
(In reply to comment #34)
> > 1. Does P2 fetch the mirror list (or the best mirror) separatly from the
> > artifact.jar/content.jar each time a user wants to install a new feature(s)?   
> No, it happens roughly once per session. It's held in a cache in memory, so
> it's not quite that simple, but the mirror list is fetched at least once per
> session and possibly again if the cache is flushed.

I usually have my Eclipse up all the time for weeks if not months until after I install an update that requires re-start.  Am I at risk of using a stale mirror cache?     


Comment 39 Pascal Rapicault CLA 2009-06-15 21:24:20 EDT
>I usually have my Eclipse up all the time for weeks if not months until after I install an update that requires re-start.  Am I at risk of using a stale mirror cache?     
 No
Comment 40 David Williams CLA 2009-10-09 13:33:11 EDT
Sorry for the slow, untimely triage. This is a mass update of around 10 open bugs in the "Simultaneous Release" category. This was done "blindly" without reading the bugs or giving them proper triage. 

I am closing as 'invalid' only because this "Simultaneous Release" product category was for something other than normal bugs. If this issue is still a problem for you, please reopen and/or move to 
another specific category if possible. If a specific component is not possible, please use "Eclipse Foundation", "Community", "Cross-Project".

Thank you, and again, sorry for not being more timely.
Comment 41 David Williams CLA 2010-02-08 10:56:22 EST
Looks like this was a lost bug, due to being in SimRel component. Reopening on cross-project component since this issue has come back up on cross-project list.
Comment 42 Denis Roy CLA 2010-02-08 11:13:32 EST
(In reply to comment #37)
> Over private emails and phone conversations, I have suggested an approach to
> Wayne and Denis to address this problem that requires no p2 code change. It is
> an adaptation of the old UM trick adapted to p2 where the artifacts.xml knows
> to go to the foundation to get a particular file.

For the life of me, I cannot remember the actual implementation of this.  What needs to be done to artifacts.xml?

David, thanks for digging this up.
Comment 43 John Arthorne CLA 2010-02-08 11:41:48 EST
See also bug 187968, which appears to be a duplicate. I suggest marking that bug as a duplicate of this one, since this bug has longer discussion history and CC list.
Comment 44 Denis Roy CLA 2010-02-08 11:47:30 EST
> See also bug 187968, which appears to be a duplicate. 

They are different.  This one asks for a way to put p2 activity in our downloads database so that committers can run download stat queries.

Bug 187968 is requesting publicly available download statistics (since running download queries is only available to committers).
Comment 45 David Williams CLA 2010-06-14 03:04:18 EDT
I think this is fixed with Helios's download.stats ... right? Please reopen if I'm missing something.