[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[provisioning-dev] Improved update generation/distribution algorithm

The current feature-based update sites can get huge quickly when small numbers of changes are scattered across a large number of features.  This kind of situation happens easily when releasing a dot-level or a patchlevel upgrade--ie: 1.0.0 to 1.0.1--where lots of small bugs are fixed.

The purpose of this email is to open a dialog and to outline our solution.  Apologies in advance for those who are less comfortable with expressing algorithms in formal math.  Time constraints have dictated communication using the most compact and efficient means possible.

The algorithm we developed worked as follows:

The build server maintains copies of all production versions of the application.  In the general case, then, we need an algorithm to upgrade a given user from a version V(n) to a version V(n+delta) where delta is the number of versions that have elapsed since the user last upgraded.  We want this algorithm to have the following properties:
You will notice some conflicts in the requirements above.  In particular, the requirement to send the fewest number of bits over the wire conflicts with the requirements to be as simple as possible and to be verifiably correct.  The result was that our algorithm sacrifices some efficiency in the number of bits it sends over the wire in order to be simpler and more verifiably correct.

The result: In our application, going from version 1.0.0 to version 1.0.1 using update site features resulted in about a 10 meg download.  After applying our algorithm, the download size shrunk to about 3 megs.  This is still large for some of our users who utilize a dial-up connection, but it is doable, whereas the 10 meg download would simply be prohibitive.

Here is the algorithm:

First, we simplify the problem.  Instead of upgrading users from an arbitrary version V(n) to V(n+delta), we only upgrade them from V(n) to V(n+1).  Going to V(n+delta) is simply an iteration of the algorithm over delta steps.  This simplification works fine for our user population.  If it would be acceptable for the general Eclipse population is an open question.

Second, we notice that sending JAR'd plugins is very expensive.  Suppose only two classes in a large plugin change: the whole plugin must still be shipped.

Our algorithm then is simple.  We dynamically generate a function F where F(P(V(n))) = P(V(n+1)).  F is really just a zipped collection of files that encodes just the files that changed between the two plugin versions.

This is accomplished using the following algorithm.  For a given plugin P in two versions V(n) and V(n+1):
  1. Unpack P(V(n)) and P(V(n+1)) into temporary directories.
  2. For files that correspond to each other in the two plugins, MD5 hash them, and if the hashes are different, put the file from P(V(n+1)) into F.
  3. Include new files from P(V(n+1)) in F.
  4. Create a text file manifest listing files in P(V(n)) that are no longer included in P(V(n+1)) and include that text file in F.
  5. Package all of the files in F in a zip file.
I believe that the algorithm for applying this function F to P(V(n)) to get a P(V(n+1)) is self-evident.

This approach has advantages over binary-diffing the two plugin jars P(V(n)) and P(V(n+1)): If the order of the files contained in the plugin changes between versions, a naive binary diff algorithm will not achieve nearly the same amount of "compression" that we achieve.

So we have described a simple algorithm that is able to dramatically reduce download bandwidth requirements over using update sites.  However, we have not described how we achieved the "fail-fast" requirement.  If a flawed update site is ever generated, we want to detect this automatically at build time rather than after 50,000 clients have already downloaded it and broken their installs.

Here is the algorithm we use:

After a build has completed, we have the following artifacts on our build server:

P(V(n)) - the old version
P(V(n+1)) - the original new version generated by PDEBuild
F - the function (basically an intelligent binary diff) we generated that we can apply to P(V(n)) to get P(V(n+1))

The approach we take to validate F is the following:
  1. Apply F to P(V(n)) on the build server using the same code that is running on the clients.
  2. Let's call the result of applying F to P(V(n)), P(V(n+1))'  (notice the prime there).  So formally, F(P(V(n))) = P(V(n+1))'
  3. We need to prove that P(V(n+1))' is the same as P(V(n+1)), the version originally generated by PDEBuild.
  4. To do this, we simply unpack P(V(n+1))' and P(V(n+1)) into temporary directories and perform an MD5 hash on all files, making sure that all files present in one are also present in the other and that their hashes are equal.
  5. If applying F to P(V(n)) on the build server using the same code as is running on the clients produces a P(V(n+1))' identical to P(V(n+1)) on the server, then it should work on the clients as well.
In our implementation, all update site building on the build server was done in Ruby.  The effort to implement the server side of this algorithm in Ruby was about three days.  The effort to implement the client-side (fully unit-tested, pair-programmed, one pair programming team working on it full-time) in Java was about a month.  This was due to the hassle of working with Java input and output streams.  However, the result was that the Java code worked entirely in memory, piping input streams into output streams and so on until only the final results were written to disk.  Thus, the Java solution would be immune to breakage resulting from something on the file system being changed in a way we did not expect.

Let me know if you have any questions.


Dave Orme