Wow, what a painful release this was (is?)
If you were able to upgrade to Galileo SR1/Eclipse 3.5.1 in a timely manner, you were probably just lucky. Even today, 5 full days after the release, our servers are still crawling.
What happened?
In late August, Karl and I had installed our new Cisco load balancer and firewall. Unbeknownst to us, we were dropping connections. A few committers noted that CVS connections were causing broken builds, and we had early reports from our mirrors that their RSYNC connections were being terminated. We didn’t pay too much attention to the RSYNC issue in favour of resolving CVS, since RSYNC is one of those robust protocols that is essentially bomb-proof.
Mistake 1.
Fast forward to Friday, Sept. 25. I did real quick mirror check and everything checked out. We’re good to go.
Mistake 2.
I mean, this is just a point release, and I’ve done millions of these. Business as usual, right?
Mistake 3.
At around 3:00pm ET on Friday, I was getting reports that the ZIP files were missing on most of the mirrors, despite the fact that they were considered in sync. Uh oh. Since Karl had found (and fixed) some short timeouts that may have caused the dropped connections, I went on to assume that mirrors were simply not yet fully up-to-date with Galileo SR1, and that they would be in sync sometime during the weekend.
Mistake 4
As it turns out, since late August, our mirrors would begin syncing, but would never finish. They were all badly out of date, but still considered in sync but because they checked in regularly. So they spent most of the weekend simply catching up, without actually getting the new SR1/3.5.1 files.
On Monday, the above became painfully apparent when we were caught serving p2 updates for most of the planet from a single 100 megabit internet connection. At this point, mirrors were having a difficult time pulling updates from us. I then brilliantly re-routed most of our downloads to our Amazon AWS account, after making sure it was in sync.
Wrong again, hero.
My uploads to AWS were also not completing. Apparently, when you update Eclipse, there are content/artifact jar files everywhere in our tree that need to be fetched. Some of those were not on AWS yet, causing the updates to fail.
“Epic Fail.” What have you learned?
When you think it’s business as usual, you’re probably wrong. Plenty of learned lessons here.
What happens now?
Most of our mirrors are now in sync, and so is our Amazon AWS. p2 probably got burned by many broken mirrors and now only trusts the home site. It will eventually learn to trust its mirrors again. Until then, updates may be a bit slow, but they should succeed.
Posted September 30th, 2009 by Denis Roy in category: Uncategorized
You can skip to the end and leave a response. Pinging is currently not allowed.
10 Responses to “Wow, what a painful release this was (is?)”
Leave a Reply
You must be logged in using your Eclipse Bugzilla account to post a comment.


Chris Aniszczyk Says:
September 30th, 2009 at 2:26 pm
Thanks for being on top of this!
Denis Roy Says:
September 30th, 2009 at 3:27 pm
Thanks, but I wish I had REALLY been on top of it last Friday!
Ian Bull Says:
September 30th, 2009 at 4:05 pm
Thanks Guys! Like always, when things go well nobody notices… after all this was a point release and you’ve done millions of these ;-). But when things go wrong for you — all hell breaks loose. You guys do an excellent job! Good work tracking this down… I’m looking forward to March when we can all sit in the Hyatt lobby, have a few beers, and a few good laughs about this.
Nick Boldt Says:
September 30th, 2009 at 4:38 pm
Yet another year where it’s assumed that “it’s just a maintenance release, what could go wrong?”
*sigh*
For the third year in a row, let me sing the same song: maintenance releases are just like GA releases — EVERYONE WANTS THEM IMMEDIATELY. Thus, they must be treated with the same forethought, planning, panic and apprehension as in June. Capisce?
Kim Moir Says:
September 30th, 2009 at 5:05 pm
Perhaps there should be a JUnit test or similar to assess the readiness of the mirrors for the release (or in general). Check the mirrors to see how many have the EPP packages, the Galileo repo and the repos referenced by the Galileo repo. Check the number of mirrors with current content against a preset threshold that are needed for release without bringing the eclipse.org servers to their virtual knees. Try a test update and see what mirror the bundles are downloaded from. The p2 tests have some scenarios that test updating from 3.5 -> 3.6 etc that might be able to be used for this work. I’d be willing to help out with this if you’re interested.
Denis Roy Says:
September 30th, 2009 at 7:59 pm
@Ian: Amen to that!
@Nick: Oh papa Nick, you are so learned!
@Kim: Absolutely. Shall I open a FR or shall you?
Kim Moir Says:
September 30th, 2009 at 8:07 pm
@Denis - What’s a FR - I assume you mean a bug?
Denis Roy Says:
September 30th, 2009 at 9:04 pm
Yeah… Feature request? After 5 years I thought I knew the lingo…
ekkehard Says:
October 1st, 2009 at 12:51 am
thx to make it run again.
I was frustrated last days, but I ever thought “how worse must this be for you” - think we all learned much (and P2 should be more friendly to users if access to servers fails
ekke
Kim Moir Says:
October 1st, 2009 at 2:21 pm
@Denis I opened a bug. https://bugs.eclipse.org/bugs/show_bug.cgi?id=291087
@ekke: How is p2 supposed to be more friendly if none of the content was available on list of mirrors generated by the foundation? When all mirrors fail, the fallback is to the original server (download.eclipse.org) which was overwhelmed with traffic and therefore unresponsive. Should p2 pop up a dialog saying “Please try upgrading again in two days”