Bug 492104 - Bad gateway, read timeout errors for download.eclipse.org
Summary: Bad gateway, read timeout errors for download.eclipse.org
Status: CLOSED DUPLICATE of bug 487915
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Servers (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Webmaster CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 492097 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-04-20 11:35 EDT by Marc-André Laperle CLA
Modified: 2016-05-03 10:32 EDT (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc-André Laperle CLA 2016-04-20 11:35:34 EDT
We have been seeing errors when builds are contacting download.eclipse.org. Before, they were rare but they are now much more frequent. It seems that it happens more in the 4AM-10AM time frame.

For example, a Trace Compass build:
10:09:12 [ERROR] Failed to resolve target definition tracecompass-e4.5.target: Failed to load p2 metadata repository from location http://download.eclipse.org/tools/cdt/releases/8.8.1/: Communication with repository at http://download.eclipse.org/tools/cdt/releases/8.8.1 failed. Read timed out -> [Help 1]

Another Trace Compass build:
04:03:51  [ERROR] Failed to resolve target definition /jobs/genie.tracecompass/tracecompass-master-nightly/workspace/releng/org.eclipse.tracecompass.target/tracecompass-e4.6.target: Failed to load p2 metadata repository from location http://download.eclipse.org/tools/cdt/builds/neon/milestones/: HTTP Server 'Bad Gateway' : http://download.eclipse.org/tools/cdt/builds/neon/milestones/content.xml: HttpComponents connection error response code 502. -> [Help 1]

A CDT build:
[ERROR] Failed to resolve target definition /jobs/genie.cdt/cdt-verify/workspace/releng/org.eclipse.cdt.target/cdt.target: Failed to load p2 metadata repository from location http://download.eclipse.org/tm/updates/4.0/: HTTP Server 'Bad Gateway' : http://download.eclipse.org/tm/updates/4.0/content.xml: HttpComponents connection error response code 502. -> [Help 1]
[ERROR]


We strive to achieve stable and reliable builds and this is one of the causes of instability. It would be great it the situation could be improved.
Comment 1 Denis Roy CLA 2016-04-20 13:24:07 EDT
*** Bug 492097 has been marked as a duplicate of this bug. ***
Comment 2 Marc-André Laperle CLA 2016-04-26 10:37:41 EDT
Two of our builds failed similarly at 4AM this morning. Just adding some data points to the bug.
Comment 3 Pierre-Charles David CLA 2016-05-03 05:31:02 EDT
We're seeing again the same kind of failures in Sirius (as reported initialy in bug 492097): for example https://hudson.eclipse.org/sirius/view/active/job/sirius-master/PLATFORM=neon,jdk=JDK-1.8.0/1583/console

ERROR] Failed to resolve target definition /jobs/genie.sirius/sirius-master/workspace/PLATFORM/neon/jdk/JDK-1.8.0/packaging/org.eclipse.sirius.parent/../../releng/org.eclipse.sirius.targets/./sirius_neon.target: Failed to load p2 metadata repository from location http://download.eclipse.org/sirius/updates/nightly/latest/neon/incubation/: HTTP Server 'Bad Gateway' : http://download.eclipse.org/sirius/updates/nightly/latest/neon/incubation/content.xml: HttpComponents connection error response code 502. -> [Help 1]

Note that the content.xml file mentioned here does not exist, as the http://download.eclipse.org/sirius/updates/nightly/latest/neon/incubation/ repo is a composite p2 repo. My understanding is that the HTTP server should fail quickly with a 404 in such a case, letting p2 try again with the correct compositeContent.xml. The 502 error returned instead aborts the whole build.
Comment 4 Denis Roy CLA 2016-05-03 09:37:38 EDT
> My understanding is that the HTTP server should
> fail quickly with a 404 in such a case, letting p2 try again with the
> correct compositeContent.xml. The 502 error returned instead aborts the
> whole build.


That should be the case. However, many years ago we changed the way our 404 is handled, as we are serving in excess of 14M 404's per day. That's an average of 164 404's per second, and peak times exceed 200 404's per second.

We don't want to return a beautiful 13K web page as most are Java clients, so they get a 13 byte "404 Not Found" response.

The 50x codes we're seeing today is a manifestation of the 404 handler being overwhelmed.  We do have new hardware on the way to help fix the problem.
Comment 5 Denis Roy CLA 2016-05-03 09:38:44 EDT

*** This bug has been marked as a duplicate of bug 487915 ***
Comment 6 Pierre-Charles David CLA 2016-05-03 09:58:40 EDT
(In reply to Denis Roy from comment #4)
> > My understanding is that the HTTP server should
> > fail quickly with a 404 in such a case, letting p2 try again with the
> > correct compositeContent.xml. The 502 error returned instead aborts the
> > whole build.
> 
> 
> That should be the case. However, many years ago we changed the way our 404
> is handled, as we are serving in excess of 14M 404's per day. That's an
> average of 164 404's per second, and peak times exceed 200 404's per second.

OK, thanks for the explanation. Does this mean adding explicit p2.index files in our repos would help reduce the load? The Sirius repos represent a tiny drop in the global load issue, but maybe it would keep us out of the problematic path?
Comment 7 Denis Roy CLA 2016-05-03 10:32:06 EDT
Reducing the amount of 404s we serve would definitely help.