[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ecf-dev] E-intro [Was Efficient downloads]

Hi Scott, comments inside.

- resume from a different location (e.g. different mirror)

Hmm. Don't know how you are going to accomplish that without something quite different from normal http, but sounds interesting.

Not sure for what protocols we are able to implement. To do this, we must be able to start downloading at a particular offset and finally check the file consistency, e.g. using a digest file if available. We also have to have a list of mirrors containing the same artifact (let's assume we've obtained it somewhere). This should be possible with http. There could be API supporting this feature. Protocols which wouldn't support this would either make a workaround, or throw an exception.

- retrieving information from special headers (like Content-Disposition)
- detecting URL redirections to final mirrors

I'm not sure what you are going to use to implement this, but would be curious to find out.
If you download a file from an URL, you have to discover the filename if user doesn't specify it explicitly. The most precise solution is parsing the Content-Disposition header if it's available (browsers use it for determining the name of the file to save). Unlike other http headers, Content-Disposion has a very complex syntax. We should be able to parse it properly.

Detecting URL redirections would help us in statistics collection. It would be wrong to assign statistics belonging to different mirrors to one URL covering all the mirrors. This is why we should detect that reading from the covering URL points to different mirrors on different retrieval attempts. Finally we could automatically deprecate using some of the black-listed mirrors to avoid speed or timeout problems.

I think you would need to describe what statistics are desired here. We can easily add adapter interfaces for collecting statistics associated with a given file retrieval/all to ecf or individual providers, but would need to know what stats are of interest.

The most interesting statistics:
- average download speed (related to concrete mirrors, geographical provider/consumer location, day time etc.)
- amount of bytes downloaded from particular location / during particular time period
- frequency of timeouts including timeout values
- etc.

We could share the statistics among users in an application by storing them on a server (the downloader would send the statistics to the server automatically). This would prevent users from attempts to access corrupted/slow repositories.

 Filip Hrbek