367433 – WebSocket ends up in CLOSE_WAIT when behind ELB

Bug 367433 - WebSocket ends up in CLOSE_WAIT when behind ELB

Summary: WebSocket ends up in CLOSE_WAIT when behind ELB

Status:	RESOLVED FIXED

Alias:	None

Product:	Jetty
Classification:	RT
Component:	documentation (show other bugs)
Version:	7.5.4
Hardware:	PC Mac OS X - Carbon (unsup.)

Importance:	P3 normal (vote)
Target Milestone:	7.5.x
Assignee:	Joakim Erdfelt
QA Contact:

URL:
Whiteboard:
Keywords:	Documentation

Depends on:
Blocks:

Reported:	2011-12-22 10:52 EST by jfarcand
Modified:	2014-07-17 16:00 EDT (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description jfarcand

2011-12-22 10:52:37 EST

Build Identifier: 

We are using Jetty 7/8 with WebSocket in production, using the Atmosphere Framework. I do see a lot of CLOSE_WAIT socket when I load balance Jetty using Amazon ELB:

 CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:55149         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24941         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24429         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24682         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:55403         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24939         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24427         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24937         CLOSE_WAIT

I do see those CLOSE_WAIT count increasing for an approximate load of 400 requests/seconds. I did try 8.1.0.RC1 and latest SNAPSHOT and the issue is still there. Not all the CLOSE_WAIT are reclaimed by the OS. 

If I don't front Jetty with ELB, The number of CLOSE_WAIT is close to 5 times lower, but I still see persietent one. 

I will update this issue with network/traffic information between ELB and Jetty soon.


Reproducible: Always

Comment 1 Greg Wilkins

2011-12-23 03:21:21 EST

I've added some unit tests to 
jetty-websocket/src/test/java/org/eclipse/jetty/websocket/WebSocketMessageRFC6455Test.java

to try to reproduce, but no joy.  Neither testTCPClose nor testTCPHalfClose leaves a connection in CLOSE_WAIT.

Could you try to reproduce in a test harness?  or capture a TCP/IP trace?

Comment 2 jfarcand

2011-12-23 08:47:33 EST

OK I will try, it need to run behind ELB. Now as you expected, I've set the websocket/continuation timeout lower than the ELB timeout and almost all the CLOSE_WAIT are gone. So it's clearly ELB that cause that. More information coming today or next year :-)

Comment 3 Jan Bartel

2012-08-30 01:17:13 EDT

Joakim,

maybe you can take over looking at this issue, given your focus on websocket?

thanks
Jan

Comment 4 Joakim Erdfelt

2013-01-23 11:28:05 EST

Considering the websocket side of this fixed.

Moving to documentation component for addition to documentation about this situation.

Leaving open and assigned to me.

Comment 5 Ron Gonzalez

2013-09-05 09:29:56 EDT

We also experience the same issue since Jetty 6.1.26
We patched this problem by adding a call to _socket.close() in SocketEndpoint.java in shutdownoutput.

The only reason we figured out what the problem was is because we kept getting exceptions method not allowed for SSL Sockets.  We performed the following diff to note the following changes:

http://grepcode.com/file_/repo1.maven.org/maven2/org.mortbay.jetty/jetty/6.1.26/org/mortbay/io/bio/SocketEndPoint.java/?v=diff&id2=6.1.25

This issue does caused leaked file descriptors which eventually crashes applications or degrades the performance significantly.

Originally, we thought the issue was related to HTTP 1.0's lack of connection keep alive when sending requests to the SSL listener.  We were able to reproduce on jetty 6.1.26 using the following python script (but we cannot reproduce the same behavior on 7).  We do know that it happens when using Load Balancers, proxies, or other intermediate devices.

import urllib
import time
while (1):
f = urllib.urlopen("http://<host>:port/resource")
print f.read()
#time.sleep(1)

What we do know is that this issue does affect NIO as well as BIO and that we might patch it by adding a call to close the socket in nio/channelendpoint.java "shutdownoutput" method.

Comment 6 Ron Gonzalez

2013-09-05 10:30:23 EDT

I found the following comment in the documentation here:
http://www.eclipse.org/jetty/documentation/current/configuring-connectors.html

soLingerTime	A value >=0 set the socket SO_LINGER value in milliseconds. Jetty attempts to gently close all TCP/IP connections with proper half close semantics, so a linger timeout should not be required and thus the default is -1.

with a link to the following discussion:
http://stackoverflow.com/questions/3757289/tcp-option-so-linger-zero-when-its-required


I believe however that load balancers / proxies are not allowing for clean connection closes, resulting in hundreds of connections in TIME_WAIT and for the jetty process to eventually hit the process file descriptor limit.

Are there any other prescribed workarounds for this other than setting soLingerTime?

Comment 7 Jan Bartel

2013-11-07 22:23:19 EST

Joakim,

Assigning back to you to do the doco.

Jan

Comment 8 Joakim Erdfelt

2014-07-17 16:00:30 EDT

Closing as documentation is updated about timeouts.