Folks,
In investigating *why* the build server keeled at 1:52am ET, I found
these artifacts interesting:
1. CRON: (pkimlach) CMD
/opt/public/technology/higgins/site-build/finder.sh
-- that job runs every minute, and apparently looks for (and launches)
a build
2. 01:33:05 build: Accepted publickey for nickb from 209.217.126.109
3. 01:40:08 build: (mknauer) CMD (/bin/bash
/shared/technology/epp/epp_build/34/org.eclipse.epp/releng/org.eclipse.epp.config/startEPP34.sh
Although none of those killed the server, the combined efforts of many
probably led to death by a thousand cuts. At that time, the server was
likely also busy signing, running a build for WTP, serving a few web
pages, NNTP news articles, updating PlanetEclipse.org, etc.
To ensure the build server provides adequate service at all times,
please consider the following :
1. Make sure your cron jobs detect the presence of an unfinished
job! This is especially true of those jobs that run every minute,
5 minutes, etc. Although your job may only take seconds to run, when
the server is very busy it could take minutes. By then, you have 6
jobs running, which slows the server even more, spawning more jobs,
until 300 jobs are running and the server explodes.
#!/bin/bash
LOCKFILE=/tmp/technology.babel.minutejob
if [ ! -f $LOCKFILE ]; then
touch $LOCKFILE
# do the stuff
else
echo "Another job is running, and I'm so confused. Aborting this one."
fi
rm $LOCKFILE
2. The motd specifically states to not run builds between 00:00
and 2:00am local time. Everyone assumes that our servers are perfectly
idle at night, which is not true.
3. Before setting a cron job at some random time, observe the servers'
load average over a 24-hour period, and choose your time accordingly.
Looking at the 24-hour graph below, it seems 6am - 8am local time, and
6:00pm - midnight are quite good.
https://dev.eclipse.org/committers/loadstats/showmonthstats.php?server=/home/data/common/monitor/loadstats/build&year=2008&month=6&day=4
4. Our servers are bored on Saturdays and Sundays! Perfect
time to run those CVS cleanup tasks, weekly builds, etc.
5. When running continuous builds, be considerate to others -- set your
jobs as lower priority. Props to AspectJ for doing this already by
launching :
nohup nice ../cc271/cruisecontrol.sh
6. Ask the server if now is a good time to do something:
while [ $(awk -F. '{print $1}' /proc/loadavg) -gt 8 ]; do echo "Too
busy to build. Going to sleep."; sleep 60; done
If you have any questions, or if you'd like more tips on avoid server
demolition, please don't hesitate to ask. We're here to help.
--
Denis Roy
Manager, IT Infrastructure
Eclipse Foundation, Inc. -- http://www.eclipse.org/
|