Community
Participate
Working Groups
We (the Platform) have been getting fairly frequent "connection timed out" error during out builds. And this is getting something from 'downloads' server to the 'build' server. It has been happening once every one or two weeks. [ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:0.23.1:mirror (mirror-build-emf) on project eclipse.platform.repository: Error during mirroring: Mirroring failed: Messages while mirroring artifact descriptors.: [Unable to read repository at http://download.eclipse.org/modeling/emf/emf/updates/2.12milestones/base/S201601280808/plugins/org.eclipse.emf.ecore_2.12.0.v20160128-0808.jar.]: connect timed out I am not sure if I can collect or provide more data to get at the root problem? I thought if nothing else I should "keep track" of them ... in this bug. An exception was printed I have pasted below. = = = = = = = = = Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.eclipse.ecf.internal.provider.filetransfer.httpclient4.ECFHttpClientProtocolSocketFactory.connectSocket(ECFHttpClientProtocolSocketFactory.java:84) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
Perhaps related to bug 487945?
This happened again, on 2/23, in our Nightly build: http://download.eclipse.org/eclipse/downloads/drops4/N20160223-2000/ [ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:0.23.1:mirror (mirror-build-ecf) on project eclipse.platform.repository: Error during mirroring: Mirroring failed: Messages while mirroring artifact descriptors.: [Unable to read repository at http://download.eclipse.org/rt/ecf/3.12.0/site.p2/features/org.eclipse.ecf.filetransfer.httpclient4.feature_3.12.0.v20151130-0157.jar.]: connect timed out The jar in question does exist and I could "fetch" a few hours later (when I noticed the build failure).
*** Bug 488383 has been marked as a duplicate of this bug. ***
We might need to increase the worker limit on download.e.o or advance its replacement machine. At some periods during the day it's reaching its 20,000 connection limit.
Connection limit was 15000, increased to 24000. I've recently upped bandwidth limit from 210 Mbps to 250 Mbps. That should carry us a while. Reopen if you still see timeouts.
Fixed means fixed.
*** Bug 487945 has been marked as a duplicate of this bug. ***
Would this same problem also cause DNS errors? I ask for two reasons: 1. I just noticed an error in one of our logs (from Tuesday morning) that said [WARNING] [Tue Feb 23 08:24:51 EST 2016] HTTP request failed. HTTP Error 500 (reason: Codesign tool(running on: build,456) exit status: 1.) updating: META-INF/MANIFEST.MF jarsigner: unable to sign jar: java.net.UnknownHostException: timestamp.geotrust.com Error 500: Codesign tool(running on: build,456) exit status: 1. Server response has been saved to '/opt/public/eclipse/builds/4I/gitCache/eclipse.platform.releng.aggregator/eclipse.platform.swt.binaries/bundles/org.eclipse.swt.gtk.linux.s390/target/org.eclipse.swt.gtk.linux.s390-3.105.0-SNAPSHOT.jar-1797467009475557455-RemoteJarSigner.log' 2. bug 488350. A case of our debugger not being able to "connect to the VM", which is does through "localhost". That may not sound like "localhost" needs a DNS, but I have a vague memory of hearing, long ago, that on eclipse.org infrastructure even "localhost" is "looked up" on the intranet's DNS. Is that still true? (Was it ever? :)
Last night, we saw again, where things could not "find" things on downloads (or archives, in one case) and there were "Socket timed out" messages in log.
I suspect part of the problem is with the way we're serving the OOmph setup files. To allocate more bandwidth we're serving them via www.eclipse.org but using a ProxyPass to download.eclipse.org. We need to change that. It's a waste of connections.
This happened twice more yesterday, once in evening during our build, and once in afternoon during some tests. The build error "stack trace" (part of it) is below, showing the "socket timed out" error. = = = = = = [exec] at org.eclipse.equinox.internal.p2.artifact.repository.ArtifactRepositoryManager.loadRepository(ArtifactRepositoryManager.java:100) [exec] at org.eclipse.equinox.internal.p2.director.app.DirectorApplication.initializeRepositories(DirectorApplication.java:532) [exec] at org.eclipse.equinox.internal.p2.director.app.DirectorApplication.run(DirectorApplication.java:1106) [exec] at org.eclipse.equinox.internal.p2.director.app.DirectorApplication.start(DirectorApplication.java:1293) [exec] at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196) [exec] at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134) [exec] at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104) [exec] at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:388) [exec] at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:243) [exec] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [exec] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [exec] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [exec] at java.lang.reflect.Method.invoke(Method.java:497) [exec] at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:670) [exec] at org.eclipse.equinox.launcher.Main.basicRun(Main.java:609) [exec] at org.eclipse.equinox.launcher.Main.run(Main.java:1516) [exec] at org.eclipse.equinox.launcher.Main.main(Main.java:1489) [exec] Caused by: java.net.SocketTimeoutException: Read timed out [exec] at java.net.SocketInputStream.socketRead0(Native Method) [exec] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) [exec] at java.net.SocketInputStream.read(SocketInputStream.java:170) [exec] at java.net.SocketInputStream.read(SocketInputStream.java:141) [exec
Tonight 3/21, we had a build fail due to a more specific sort of "connection time out" (from the little I have looked at it). See http://download.eclipse.org/eclipse/downloads/drops4/N20160321-2000/buildFailed-pom-version-updater But a small sample is [ERROR] [ERROR] Some problems were encountered while processing the POMs: [ERROR] Unresolveable build extension: Plugin org.eclipse.tycho:tycho-maven-plugin:0.23.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-core:jar:3.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/23.235.46.215] failed: Connection timed out @ [ERROR] Unresolveable build extension: Plugin org.eclipse.tycho:tycho-maven-plugin:0.23.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-core:jar:3.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/23.235.46.215] failed: Connection timed out @ [ERROR] Unknown packaging: eclipse-target-definition @ line 11, column 14 [ERROR] Unresolveable build extension: Plugin org.eclipse.tycho:tycho-maven-plugin:0.23.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-core:jar:3.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/23.235.46.215] failed: Connection timed out @ [ERROR] Unresolveable build extension: Plugin org.eclipse.tycho:tycho-maven-plugin:0.23.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-core:jar:3.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/23.235.46.215] failed: Connection timed out @ [ERROR] Unresolveable build extension: Plugin org.eclipse.tycho:tycho-maven-plugin:0.23.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-core:jar:3.0 from/to central (https://repo.maven.apache.org/maven2): Connect to repo.maven.apache.org:443 [repo.maven.apache.org/23.235.46.215] failed: Connection timed out @
Tonight, 4/24, had network issues, apparently, getting stuff from https://repo.maven.apache.org/maven2 from build.eclipse.org. Note: no trouble at all getting this dependency on my own local test machine. Also, note, we do not specify that URL, https://repo.maven.apache.org/maven2. I am assuming this is some automatic thing, set up by Eclipse Foundation, or Maven itself? This problem was worse that usual since it wasn't a "temporary" glitch, but lasted about two or three hours, at least -- I tried just to repeat the build, but got the exact same error. First failure log: http://download.eclipse.org/eclipse/downloads/drops4/I20160424-2000/buildlogs/mb060_run-maven-build_output.txt Second failure log: http://download.eclipse.org/eclipse/downloads/drops4/I20160424-2245/buildlogs/mb060_run-maven-build_output.txt = = = = = = [ERROR] Failed to execute goal on project org.eclipse.equinox.p2.metadata.repository: Could not resolve dependencies for project org.eclipse.equinox:org.eclipse.equinox.p2.metadata.repository:eclipse-plugin:1.2.300-SNAPSHOT: Failed to collect dependencies at org.apache.ant:ant:jar:1.7.1: Failed to read artifact descriptor for org.apache.ant:ant:jar:1.7.1: Could not transfer artifact org.apache.ant:ant:pom:1.7.1 from/to central (https://repo.maven.apache.org/maven2): Connection reset org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project org.eclipse.equinox.p2.metadata.repository: Could not resolve dependencies for project org.eclipse.equinox:org.eclipse.equinox.p2.metadata.repository:eclipse-plugin:1.2.300-SNAPSHOT: Failed to collect dependencies at org.apache.ant:ant:jar:1.7.1 at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependencyResolver.java:221) at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies(LifecycleDependencyResolver.java:127) at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved(MojoExecutor.java:257) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:200) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
(In reply to David Williams from comment #13) > Tonight, 4/24, had network issues, apparently, getting stuff from > https://repo.maven.apache.org/maven2 > from build.eclipse.org. ... Looks like an illegal (and unnecessary) external dependency in p2. Filed bug 492367.
Received a similar failure from this morning's build: [ERROR] Failed to execute goal org.eclipse.cbi.maven.plugins:eclipse-cbi-plugin:1.1.3:generate-api-build-xml (default) on project eclipse-platform-parent: Execution default of goal org.eclipse.cbi.maven.plugins:eclipse-cbi-plugin:1.1.3:generate-api-build-xml failed: Plugin org.eclipse.cbi.maven.plugins:eclipse-cbi-plugin:1.1.3 or one of its dependencies could not be resolved: Failed to collect dependencies at org.eclipse.cbi.maven.plugins:eclipse-cbi-plugin:jar:1.1.3 -> org.apache.maven:maven-plugin-api:jar:3.1.1 -> org.eclipse.sisu:org.eclipse.sisu.plexus:jar:0.0.0.M5 -> org.sonatype.sisu:sisu-guice:jar:no_aop:3.1.0 -> aopalliance:aopalliance:jar:1.0: Failed to read artifact descriptor for aopalliance:aopalliance:jar:1.0: Could not transfer artifact aopalliance:aopalliance:pom:1.0 from/to central (https://repo.maven.apache.org/maven2): Connection reset
(In reply to David Williams from comment #15) > Received a similar failure from this morning's build: > Earlier in that same build was another, git-related error message, that says "fatal", but seemed to continue: To file:///gitroot/platform/eclipse.platform.releng.aggregator.git 0ad5f09..bcea47e HEAD -> master fatal: read error: Connection reset by peer Cloning into '/shared/eclipse/builds/4I/siteDir/eclipse/downloads/drops4/I20160425-0800/eclipse.platform.releng.aggregator'... remote: warning: ignoring extra bitmap file: objects/pack/pack-6602b1a9f54a0512dfbcf26e1fa11a8c6460649e.pack Note: checking out '0ad5f09cd6ccc8217c3a6c6e5425b59bcf1b3fec'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at 0ad5f09... Build input for build I20160424-2245
There seems to be some issue with the infrastructure. I had many timeouts when accessing Git today via EGit and our Hudson jobs also run "forever", see e.g. https://hudson.eclipse.org/platform/job/eclipse.platform.ui-Gerrit/ Some are running for more than 5 hours now!
There seems to be a mix of issues in this bug, many of which are vague and/or have no reproducible use case. Plowing through the massive build log is not easy for us sysadmins. I'll reaffect this bug for performance to download.e.o specifically, as I've seen it run out of connections at peak times. The number one culprit is the "smart" 404 system since, at peak times, we must deliver in excess of 200 404 Not Found errors per second.
(In reply to Denis Roy from comment #18) > There seems to be a mix of issues in this bug, many of which are vague > and/or have no reproducible use case. Plowing through the massive build log > is not easy for us sysadmins. > > I'll reaffect this bug for performance to download.e.o specifically, as I've > seen it run out of connections at peak times. The number one culprit is the > "smart" 404 system since, at peak times, we must deliver in excess of 200 > 404 Not Found errors per second. Thanks, Denis. I have opened bug 492412 because it describes a failure case we see fairly often, but is admittedly intermittent. (And was perhaps unfairly mixed in here since to us a "network problem" is just a network problem :). The git problem we've seen today was pretty specific to today, and (from our build's point of view) is no longer a problem. Not sure if those in other countries are still seeing issues or not.
> The git problem we've seen today Which was odd, since you are using the 'file://' protocol. There are no connections to be reset by any peer. > To file:///gitroot/platform/eclipse.platform.releng.aggregator.git > 0ad5f09..bcea47e HEAD -> master > fatal: read error: Connection reset by peer
Created attachment 261275 [details] stack trace at end of log When it rains it pours! The latest Platform "BUILD FAILED" seems to fall into this bug, instead of the "external connections" one. [ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:0.25.0:mirror (mirror-build-ecf) on project eclipse.platform.repository: Error during mirroring: Mirroring failed: HTTP Server 'Service Unavailable': http://download.eclipse.org/rt/ecf/3.13.1/site.p2/artifacts.xml: HttpComponents connection error response code 503. -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:0.25.0:mirror (mirror-build-ecf) on project eclipse.platform.repository: Error during mirroring Full stack trace is attached.
(In reply to David Williams from comment #21) > mirroring: Mirroring failed: HTTP Server 'Service Unavailable': > http://download.eclipse.org/rt/ecf/3.13.1/site.p2/artifacts.xml: > HttpComponents connection error response code 503. -> [Help 1] Thanks, that was helpful. As I suspected in comment 18, our "smart" 404 handler is problematic, since the above URL is a 404 Not Found but caused a 503 Service Unavailable I'll see if I can tweak the handler some more, but new hardware will be incoming soon enough.
Created attachment 261308 [details] 404/50x error distrubution Here's a chart that maps the 404 vs. 50x error distribution from Tuesday (yesterday) and some random Tuesday in November. Some observations: - We definitely have very pronounced 404 spikes throughout the day - A 404 spike often leads to a 50x error spike - A 404 won't break a build, but a 50x will - We have a steady stream of 50x errors all the time, which is odd - The amount of 404s we serve today is much greater than that of November This doesn't really change our conclusion to redesign download.e.o but I thought it would be an interesting data point.
*** Bug 492104 has been marked as a duplicate of this bug. ***
Yet another failed build tonight, this time due to p2 going to http://archive.eclipse.org/webtools/downloads/drops/T3.2.0/I-3.2.0-20100521140232/repository/ from build.eclipse.org. Received a stack trace like this (partial one) : [exec] Caused by: java.net.SocketTimeoutException: Read timed out [exec] at java.net.SocketInputStream.socketRead0(Native Method) [exec] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) [exec] at java.net.SocketInputStream.read(SocketInputStream.java:170) [exec] at java.net.SocketInputStream.read(SocketInputStream.java:141) [exec] at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
David, do you know at what time that happened?
Created attachment 261464 [details] Interruption download.e.o did suffer an interruption (gray line). Not much we can do about it right now until the hardware I've ordered comes in.
(In reply to Denis Roy from comment #26) > David, do you know at what time that happened? It was around 9 or 10 PM, Tuesday, 5/3. I can not quite tell if that is "in the grey line" of your graph, but is close enough to say "that was it". (In reply to Denis Roy from comment #27) > Created attachment 261464 [details] > Interruption > > download.e.o did suffer an interruption (gray line). > > Not much we can do about it right now until the hardware I've ordered comes > in. I'm not looking for a commitment but am curious as to when that is expected. Just ballpark ... a few weeks? a few months? The next fiscal year? :) [I am partially just curious, but am also loosely considering some work to improve working around the limitation, such as auto-restart (once or twice) if we get a "BUILD FAILED" and there is a "SocketException" in the log.]
About 4 weeks to get a few servers and set them up. But I think we have some old hardware in the lab that we can cobble together to help us gain some short-term reliablility in the meanwhile. I'll also see if I can tweak our 404 handler.
I've setup some connection count logging on the proxy servers to allow us to see if that's a factor. -M.
In an attempt to help by providing data, I saw a several "socket timeout" exceptions again in our builds "tonight". It started about 10:30 PM 5/17. A typical exception was [exec] !ENTRY org.eclipse.equinox.p2.repository 4 1002 2016-05-17 22:35:01.230 [exec] !MESSAGE Communication with repository at http://download.eclipse.org/webtools/releng/repository/ failed. [exec] !STACK 0 [exec] java.net.SocketTimeoutException: Read timed out [exec] at java.net.SocketInputStream.socketRead0(Native Method) In addition, "build.eclipse.org" seemed overloaded, with "load" numbers being reported from 30 to 50 from about 10:30 to 11:30. It appears to be settling down to the usual 4 and 5, now. = = = = = = = = BUT, surprisingly, it appears p2 was able to retry and eventually was satisfied! That is actually hard to see in my logs, but the build did finally succeed. I wonder if the response code, or something, changed that allowed p2 to retry? I also did not see a "connection reset" in the log. So, good news wrapped in a mystery. = = = = = = = = If there is more good news about "tonight's failure", I also witnessed similar socket exceptions on my home system during the same time, which I normally never see. (It was attempting to connect to a different p2 repository). The best news of that is that my home system was supposed to be using a "local mirror", but was not, due to a "typo" in my settings, which is now fixed due to me seeing the socket timeout exceptions. So, maybe that is sort of half good news -- good news for me, but the same server problem. (My home build did fail, not sure why it did, but the build.eclipse.org build did not -- perhaps because my system was not overloaded so the "retries" happened very quickly :)
Thanks for the info, David.
Another thing that is impacting download connections is that, at peak times, the Oomph setup files take up to 30% of our high-priority bandwidth, leaving download.eclipse.org with very little room. I'll look to add some bandwidth temporarily so that download.e.o does not always have a queue 10 miles long.
We've since added 100Mb of bandwidth, which has brought down our connection counts dramatically.
We've added four new download servers to the pool. Each new server has the ability to handle 2x the connections the old download.e.o could handle. Closing as fixed.