User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0
I very much appreciate the sympathy and
the support. In the end, the Infra team can do better than
this. We'll lick our wounds and go back to the drawing board to
make sure we don't repeat the same mistakes twice.
Postmortem is written, pending review
with my team.
Denis
On 2021-08-09 2:00 p.m., Jonah Graham
wrote:
Thank you Denis for the clarification and the
apology. It is really useful for all of us to know and
understand what the reasonable expectations are, and that none
of the strategic members had reached out (perhaps the community
suffered a little from the bystander effect).
I am sorry that your vacation, and many others in the
Webmaster team were disrupted. Thank you for bringing
everything back online again and we all look forward to the
post mortem. I also hope you can reschedule some of the lost
vacation time so that the whole community can benefit from
you and your team being refreshed.
On 2021-08-04 4:26 a.m., Sebastian Zarnekow wrote:
I suspect that people are really not
reachable even in case of an infrastructure disaster.
This is incorrect; EF Strategic members can reach out to
Infra staff using SMS text. Not one has done that. I was
alerted (by Mikael) of an outage at 4:09am and was on a
computer at 4:14am to begin assessing the issue.
Despite the shutdowns & vacations, everyone available
on the infra team worked to restore service according to
our SLA:
Some items remained "broken" or unavailable for an
extended time, but they are not Tier I items.
We did our best to communicate the current state, but
clearly we could have done better. I'll be authoring a
postmortem shortly, and will put forth recommendations for
future events.
I understand apologies do not fix lost productivity, but
I do apologize this happened. It was just about the worst
possible type of outage at the worst possible time.
Denis
I think the best we can do right now is to learn
from the post-mortem and implement mitigations
afterwards.
I can only speculate about what's going on but
speculation will get us nowhere. Suffice to say
that I am deeply and fundamentally concerned
both by the state of our infrastructure and even
more so by the complete silence from the
Foundation.
I have tried to reach out yesterday to gain
more information, and to suggest that
information be posted to the community, but so
far without success...
It would appear to me that we will not be
able to get m2 completed today because I'm not
sure any of us can do a build and promote the
results right now. Certainly I cannot, at this
point, do any of my usual activities for m2, no
new version of Oomph, no new installers, no
product catalog updates...
Regards,
Ed
On 04.08.2021 08:52, Christoph Läubrich
wrote:
> but I guess there is
still a major problem with eclipse.org
infrastructure.
It would be good to have at least some more
information, the status page says 'A fix has
been implemented and we are monitoring the
results' but this is two days old and still we
see massive outages, updatesites are broken, CI
builds are even not running, we get bug reports
about broken functionality, that's really
annoying and frustrating.
Some basic website functionality seem to be
restored, but mailing lists / message sending
seem to be still broken across different
services.
I received few random (I guess not all) mails
from bugzilla, but I guess there is still a
major problem with eclipse.org
infrastructure.
All builds we started this morning (5 hours
ago CEST) have failed with
issues trying to reach download.eclipse.org
in one way or another, two
examples of which are: