Bug 323977 - job extremely slow on hudson.eclipse.org
Summary: job extremely slow on hudson.eclipse.org
Status: RESOLVED FIXED
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: CI-Jenkins (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords: greatbug
Depends on:
Blocks:
 
Reported: 2010-08-30 10:51 EDT by Nicolas Bros CLA
Modified: 2010-09-21 10:11 EDT (History)
0 users

See Also:


Attachments
stackdump (29.87 KB, text/plain)
2010-09-17 10:27 EDT, Nicolas Bros CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nicolas Bros CLA 2010-08-30 10:51:37 EDT
The modisco-nightly job seems to freeze, and doesn't terminate even after a few hours on hudson.eclipse.org (I waited 3h30 before terminating the job). The same job works correctly on build.eclipse.org.
Comment 1 Eclipse Webmaster CLA 2010-08-31 13:40:08 EDT
I suspect this may be related to bug 322828.

Is there a buckminster command I can use to test this?

-M.
Comment 2 Nicolas Bros CLA 2010-09-01 03:55:27 EDT
> Is there a buckminster command I can use to test this?

You should be able to reproduce by doing:

svn checkout svn://dev.eclipse.org/svnroot/modeling/org.eclipse.mdt.modisco/releng/trunk/org.eclipse.gmt.modisco.releng.buckminster

and then:

buckminster import -Declipse.download="file:/home/data/httpd/download.eclipse.org" "./org.eclipse.gmt.modisco.releng.buckminster/modisco.mspec"

I say "should be able" because I couldn't reproduce it myself, since I don't seem to have ssh access to hudson.eclipse.org.
Comment 3 Eclipse Webmaster CLA 2010-09-02 11:38:30 EDT
So I think the issues were(are) 2 fold:

1) buckminster seems to ignore the proxy settings some times (I believe this is fixed now via some iptables magic)
2) there is a 'bad' buckminster interaction occurring when this job is run on the slaves.

At this time it looks like your jobs needs to be updated to get the console text from hudson.eclipse.org instead of build.eclipse.org .

-M.
Comment 4 Nicolas Bros CLA 2010-09-09 07:10:34 EDT
Even on master where it works, the job seems unexpectedly slow. For example, this build took 1 hr 56 min:
https://hudson.eclipse.org/hudson/job/modisco-nightly/151/

The same job on build.eclipse.org runs in under one hour.
I thought hudson.eclipse.org was using the new -- faster -- servers?
Comment 5 Eclipse Webmaster CLA 2010-09-09 09:32:36 EDT
We are using the new build hardware, but the hudson instances are virtualized.  The master instance has 'fewer' resources (as the same machine also hosts the sandbox and windows slave), but the slaves split the host resources pretty much down the middle.

-M.
Comment 6 Denis Roy CLA 2010-09-09 09:48:40 EDT
If a job is taking 2x longer on the new hardware than on the build server, I suspect it is spending time waiting for something else.

If I look at your build log, I see plenty of messages that don't inspire me:

!MESSAGE An internal error occurred during: "Materializing".
!STACK 0
java.lang.NullPointerException
[snip]
WARN:  Job found still running after platform shutdown. 
[snip]
!MESSAGE The workspace exited with unsaved changes in the previous session; refreshing workspace to recover changes.
[snip]
[ant] /opt/public/jobs/modisco-nightly/workspace/buildroot/buckminster.workspace/plugins/org.eclipse.gmt.modisco.infra.query.editor/src/org/eclipse/gmt/modisco/infra/query/editor/Messages.java:9: warning: unmappable character for encoding UTF8
[ant]  *    Gr�goire Dup� (Mia-Software)

Then there's a bunch of cannot find symbol messages, and @model is an unknown tag.

Then at this time, the build almost appears to restart:

!ENTRY org.eclipse.buckminster.core 1 293 2010-09-07 09:43:20.014
!MESSAGE Overall status: Error
INFO:  Overall status: Error





I don't know much about builds, but if you want some help, you'll need to either provide a log that is easier to read, or tell us what part of the build is taking time.
Comment 7 Nicolas Bros CLA 2010-09-10 05:25:24 EDT
(In reply to comment #6)
> If I look at your build log, I see plenty of messages that don't inspire me

I agree, sometimes it looks like the build is failing even though it's not. And sometimes it fails for mysterious reasons (p2 errors, IOExceptions, SVN not working, ...) and then works again without changing anything.

> warning: unmappable character for encoding UTF8
> Then there's a bunch of cannot find symbol messages, 
> and @model is an unknown tag.

Those are just warnings when generating the javadoc. They can be ignored.

> Then at this time, the build almost appears to restart:
> 
> !ENTRY org.eclipse.buckminster.core 1 293 2010-09-07 09:43:20.014
> !MESSAGE Overall status: Error
> INFO:  Overall status: Error

This is the result of the JUnit tests, printed by Buckminster.
Buckminster seems to report an error status even when all tests passed successfully. Anyway, test failures don't make the build itself fail.

> I don't know much about builds, but if you want some help, you'll need to
> either provide a log that is easier to read, or tell us what part of the build
> is taking time.

I have added "echo" statements to mark the steps, with a time calculation at the end of each step.

The part of the build that I first notice taking an abnormally long time is the "import" done by Buckminster, near the beginning of the job.
During this import, Buckminster fetches the plug-ins for the target platform (an Eclipse installation with all the required plug-ins), and the source code to compile.
So, this is an IO and network intensive operation.

If you want to see for yourself, you can run the nightly job on both build.eclipse.org and hudson.eclipse.org, and compare the console logs:
https://build.eclipse.org/hudson/job/modisco-nightly/
https://hudson.eclipse.org/hudson/job/modisco-nightly/
Comment 8 Denis Roy CLA 2010-09-10 10:44:52 EDT
> The part of the build that I first notice taking an abnormally long time is the
> "import" done by Buckminster, near the beginning of the job.
> During this import, Buckminster fetches the plug-ins for the target platform
> (an Eclipse installation with all the required plug-ins), and the source code
> to compile.

What URLs is it using to download?  Strangely, I've seen many cases where Eclipse plugins are being downloaded from mirrors (!!) instead of from download.eclipse.org, which is on the same Gigabit local network as the build servers.
Comment 9 Nicolas Bros CLA 2010-09-10 10:52:29 EDT
(In reply to comment #8)
> What URLs is it using to download?
I have specified URLs only download.eclipse.org in the Buckminster rmap, but it keeps trying mirrors anyway. This often results in failures (see Bug 324065).
I don't know how to avoid this.
Comment 10 Nicolas Bros CLA 2010-09-10 10:53:29 EDT
I got results from the time measurements I added:

modisco-nightly took 5 hr 35 min on hudson-slave1:
BUCKMINSTER SETPREF took : 6s
BUCKMINSTER IMPORT took : 19302s (5 hours and 22 minutes!)
BUCKMINSTER RESOLVE took : 76s
BUCKMINSTER BUILD took : 74s
BUCKMINSTER JUNIT took : 591s
BUCKMINSTER PERFORM 'site.p2' took : 85s

Whereas the same job took 48 min on master on build.eclipse.org:
BUCKMINSTER SETPREF took : 25s
BUCKMINSTER IMPORT took : 1213s
BUCKMINSTER RESOLVE took : 90s
BUCKMINSTER BUILD took : 164s
BUCKMINSTER JUNIT took : 1097s
BUCKMINSTER PERFORM 'site.p2' took : 234s

So, this is definitely the import phase that is slow, as it runs 16 times more
slowly on hudson.eclipse.org than on build.eclipse.org.
The other major steps in the build are actually *faster* on hudson.eclipse.org.
Comment 11 Denis Roy CLA 2010-09-10 11:27:36 EDT
That is extremely helpful. Nice to see the new servers are so much faster than build.eclipse.org.

We'll investigate why general internet access is slower on the Hudson machines, but I'll also raise the topic of downloading Eclipse bits from external sites to a broader audience.

Would it be possible for you to document (on the Wiki, perhaps adding to this page: http://wiki.eclipse.org/Hudson) how you've added these time benchmarks to your build?
Comment 12 Nicolas Bros CLA 2010-09-10 12:33:13 EDT
(In reply to comment #11)
> Would it be possible for you to document (on the Wiki, perhaps adding to this
> page: http://wiki.eclipse.org/Hudson) how you've added these time benchmarks to
> your build?

I have added a section on the Hudson "how to":
http://wiki.eclipse.org/Hudson/HowTo#benchmark_a_build
Comment 13 Denis Roy CLA 2010-09-13 14:24:11 EDT
From the Hudson master, I've ran a few wget commands, and it seems throughput (4MB/sec downloading from Amazon AWS) and connection speed (0.12 second to fetch the google home page) are not issues.

So at this point, I'd like to know what exactly happens in a Buckminster import.  Specific URLs, download files, etc.
Comment 14 Nicolas Bros CLA 2010-09-14 10:31:14 EDT
(In reply to comment #13)
> From the Hudson master, I've ran a few wget commands, and it seems throughput
> (4MB/sec downloading from Amazon AWS) and connection speed (0.12 second to
> fetch the google home page) are not issues.
I am running the job on the "build2" group (hudson-slave1 + hudson-slave2).
Could you test the same thing on the slaves?

> So at this point, I'd like to know what exactly happens in a Buckminster
> import.  Specific URLs, download files, etc.

I've activated Buckminster debug log ("--loglevel DEBUG").
From the log, I found that the SVN imports seem to proceed at a normal speed, but get stuck in the middle.

I tried doing the SVN import from outside Buckminster, and then asking Buckminster to use the locally imported sources.

It works, but now it stays stuck on CVS imports.

For 3 builds in a row, it stayed stuck on:

!ENTRY org.eclipse.buckminster.core 0 293 2010-09-14 09:07:09.372
!MESSAGE Provider cvs(:pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse,org.eclipse.test): materializing to /opt/users/hudsonbuild/workspace/modisco-nightly/buildroot/buckminster.workspace/plugins/org.eclipse.test/

(I waited 30 minutes before killing it)
Comment 15 Nicolas Bros CLA 2010-09-14 10:53:34 EDT
This doesn't happen for all CVS imports, since the step right before the one that gets stuck is also a CVS import (for :pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse,org.eclipse.ant.optional.junit), and judging from the log's time stamps, it happened in 3ms!
Comment 16 Denis Roy CLA 2010-09-14 11:18:39 EDT
droy@hudson-slave1:~> time wget http://www.google.com/
2010-09-14 11:12:05 (1.45 MB/s) - `index.html' saved [8734]
real    0m0.185s

droy@hudson-slave2:~> time wget http://www.cbc.ca
2010-09-14 11:14:47 (5.05 MB/s) - `index.html.1' saved [38929]
real    0m0.037s


droy@hudson-slave1:~> wget http://d2u376ub0heus3.cloudfront.net/eclipse/downloads/drops/R-3.5.2-201002111343/eclipse-SDK-3.5.2-linux-gtk.tar.gz
2010-09-14 11:12:37 (17.1 MB/s) - `eclipse-SDK-3.5.2-linux-gtk.tar.gz.1' saved [170395548/170395548]

droy@hudson-slave2:~> wget -S http://download.eclipse.org/eclipse/downloads/drops/R-3.5.2-201002111343/eclipse-SDK-3.5.2-linux-gtk.tar.gz
2010-09-14 11:17:03 (64.6 MB/s) - `eclipse-SDK-3.5.2-linux-gtk.tar.gz' saved [170395548/170395548]

I've confirmed that these accesses do come from the site (ie, not cached).  The download speed from AWS was quite impressive, but not nearly as impressive as downloading the 170M SDK in 2.5 seconds.
Comment 17 Denis Roy CLA 2010-09-14 11:31:35 EDT
(In reply to comment #14)
> !ENTRY org.eclipse.buckminster.core 0 293 2010-09-14 09:07:09.372
> !MESSAGE Provider
> cvs(:pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse,org.eclipse.test):
> materializing to
> /opt/users/hudsonbuild/workspace/modisco-nightly/buildroot/buckminster.workspace/plugins/org.eclipse.test/

I haven't seen anything in the logs which would indicate a failure.  I don't suppose there is any way you can user :local: right?

cvs -d :local:/cvsroot/eclipse co org.eclipse.test
Comment 18 Nicolas Bros CLA 2010-09-14 11:45:50 EDT
(In reply to comment #17)
> I don't suppose there is any way you can user :local: right?
> cvs -d :local:/cvsroot/eclipse co org.eclipse.test

I tried to specify :local: in Buckminster's rmap, but it doesn't seem to be supported:

ERROR   Invalid CVS repository location format: :local:/cvsroot/eclipse
ERROR   Location must have form ':methodname:[user[:password]@]host:[port]/path/to/cvsroot'
ERROR   Only the following methods are supported: ext, extssh, pserver, pserverssh2
Comment 19 Nicolas Bros CLA 2010-09-15 04:01:49 EDT
I noticed something else. There are 2 projects to check out from CVS : org.eclipse.ant.optional.junit and org.eclipse.test.

In builds #168 and #172, org.eclipse.test was imported without any problem, in a few milliseconds, but the build stalled on org.eclipse.ant.optional.junit.

In builds #169 and #170, this was the contrary: org.eclipse.ant.optional.junit was imported, but org.eclipse.test wasn't even after 1 hour.


Maybe you could try to check out several projects consecutively from CVS to try and reproduce this issue?
Comment 20 Nicolas Bros CLA 2010-09-15 04:04:08 EDT
And here is the explanation as to why :local: is not working with CVS:
http://wiki.eclipse.org/index.php/CVS_FAQ#How_do_I_use_a_local_connection_with_Eclipse.3F
Comment 21 Denis Roy CLA 2010-09-15 11:34:17 EDT
> Maybe you could try to check out several projects consecutively from CVS to try
> and reproduce this issue?

I've been trying this command for the past hour and I keep getting successful checkouts:

droy@hudson-slave2:~/test> time cvs -Qd :pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse co org.eclipse.ant.optional.junit; rm -rf org.eclipse.ant.optional.junit; time cvs -Qd :pserver:anonymous@dev.eclipse.org:/cvsroot/eclipse co org.eclipse.test; rm -rf org.eclipse.test/

real    0m0.547s
user    0m0.000s
sys     0m0.004s

real    0m0.983s
user    0m0.000s
sys     0m0.000s
Comment 22 Nicolas Bros CLA 2010-09-15 11:59:40 EDT
I believe the next step to try to understand this problem would be to run Buckminster under the debugger.
Is there a way I could get ssh access to hudson.eclipse.org?
And do you know if it would be possible to debug using JDWP through an ssh tunnel?
Comment 23 Nicolas Bros CLA 2010-09-17 10:27:07 EDT
Created attachment 179118 [details]
stackdump

Here is a stackdump of Buckminster while it appears frozen.
Looking at this stackdump, I think Buckminster is waiting for p2, and p2 is waiting for a socket to connect.
Comment 24 Denis Roy CLA 2010-09-17 10:47:45 EDT
Waiting for a socket to connect to where?
Comment 25 Nicolas Bros CLA 2010-09-17 12:02:22 EDT
(In reply to comment #24)
> Waiting for a socket to connect to where?
I injected debug code in "org.eclipse.equinox.p2.repository" in my Buckminster install to find out.

The connections made by p2 are now printed to the job's log.
p2 seems to be hitting many mirrors for each file, with a long timeout between each try. There seem to be 2 timeouts set to 15 and 120 seconds.
I believe that's why it eventually finishes, but after 5 hours of timeouts.

You can try running the job : 
https://hudson.eclipse.org/hudson/job/modisco-nightly/
Comment 26 Nicolas Bros CLA 2010-09-21 06:32:35 EDT
The problem is solved.

The imports were much faster this morning, and by hacking "org.eclipse.equinox.p2.repository" in my Buckminster install to not contact mirrors outside of eclipse.org, the nightly build finished in just over 20 minutes!
Comment 27 Nicolas Bros CLA 2010-09-21 07:01:17 EDT
For comparison, here are the times recorded by the latest build:
BUCKMINSTER SETPREF took : 4s
BUCKMINSTER IMPORT took : 261s
BUCKMINSTER RESOLVE took : 72s
BUCKMINSTER BUILD took : 152s
BUCKMINSTER JUNIT took : 692s
BUCKMINSTER PERFORM 'site.p2' took : 139s
Comment 28 Denis Roy CLA 2010-09-21 10:11:39 EDT
Good stuff.  Nice troubleshooting in here, thanks.