David M Williams wrote:
I must say I am very disappointed
that
WTP is not higher on the list of suspects ... we'll try harder in the
future.
:)
Well, I see you log into the build server exactly every minute, but I
usually pin the world's problems on WTP, so I figured I'd cut you some
slack this time.
This time, I'm simply blaming it on the Nick Boldt[] :)
But, seriously ... I appreciate the
list of tips.
And, as someone who does his best
work
between midnight and 2 AM I am constantly amazed at how many people
think
that is a good to schedule something on the server under the assumption
that everyone's asleep :)
Folks,
In investigating *why* the build server keeled at 1:52am ET, I found
these
artifacts interesting:
1. CRON: (pkimlach) CMD
/opt/public/technology/higgins/site-build/finder.sh
-- that job runs every minute, and apparently looks for (and launches)
a build
2. 01:33:05 build: Accepted publickey for nickb from 209.217.126.109
3. 01:40:08 build: (mknauer) CMD (/bin/bash
/shared/technology/epp/epp_build/34/org.eclipse.epp/releng/org.eclipse.epp.config/startEPP34.sh
Although none of those killed the server, the combined efforts of many
probably led to death by a thousand cuts. At that time, the server
was likely also busy signing, running a build for WTP, serving a few
web
pages, NNTP news articles, updating PlanetEclipse.org, etc.
To ensure the build server provides adequate service at all times,
please
consider the following :
1. Make sure your cron jobs detect the presence of an unfinished
job!
This is especially true of those jobs that run every minute, 5
minutes,
etc. Although your job may only take seconds to run, when the server
is very busy it could take minutes. By then, you have 6 jobs running,
which slows the server even more, spawning more jobs, until 300 jobs
are
running and the server explodes.
#!/bin/bash
LOCKFILE=/tmp/technology.babel.minutejob
if [ ! -f $LOCKFILE ]; then
touch $LOCKFILE
# do the stuff
else
echo "Another job is running, and
I'm so confused. Aborting this one."
fi
rm $LOCKFILE
2. The motd specifically states to not run builds between 00:00
and 2:00am local time. Everyone assumes that our servers are perfectly
idle at night, which is not true.
3. Before setting a cron job at some random time, observe the servers'
load average over a 24-hour period, and choose your time accordingly.
Looking
at the 24-hour graph below, it seems 6am - 8am local time, and 6:00pm -
midnight are quite good.
https://dev.eclipse.org/committers/loadstats/showmonthstats.php?server=/home/data/common/monitor/loadstats/build&year=2008&month=6&day=4
4. Our servers are bored on Saturdays and Sundays! Perfect
time to run those CVS cleanup tasks, weekly builds, etc.
5. When running continuous builds, be considerate to others -- set your
jobs as lower priority. Props to AspectJ for doing this already by
launching :
nohup nice ../cc271/cruisecontrol.sh
6. Ask the server if now is a good time to do something:
while [ $(awk -F. '{print $1}' /proc/loadavg) -gt 8 ]; do echo "Too
busy to build. Going to sleep."; sleep 60; done
If you have any questions, or if you'd like more tips on avoid server
demolition,
please don't hesitate to ask. We're here to help.
--
Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc. -- http://www.eclipse.org/
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/cross-project-issues-dev
--
Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc. -- http://www.eclipse.org/
Office: 613.224.9461 x224 (Eastern time)
Cell: 819.210.6481
denis.roy@xxxxxxxxxxx
|