Bug 367433 - WebSocket ends up in CLOSE_WAIT when behind ELB
Summary: WebSocket ends up in CLOSE_WAIT when behind ELB
Status: RESOLVED FIXED
Alias: None
Product: Jetty
Classification: RT
Component: documentation (show other bugs)
Version: 7.5.4   Edit
Hardware: PC Mac OS X - Carbon (unsup.)
: P3 normal (vote)
Target Milestone: 7.5.x   Edit
Assignee: Joakim Erdfelt CLA
QA Contact:
URL:
Whiteboard:
Keywords: Documentation
Depends on:
Blocks:
 
Reported: 2011-12-22 10:52 EST by jfarcand CLA
Modified: 2014-07-17 16:00 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description jfarcand CLA 2011-12-22 10:52:37 EST
Build Identifier: 

We are using Jetty 7/8 with WebSocket in production, using the Atmosphere Framework. I do see a lot of CLOSE_WAIT socket when I load balance Jetty using Amazon ELB:

 CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:55149         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24941         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24429         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24682         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:55403         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24939         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24427         CLOSE_WAIT  
tcp        1      0 10.168.175.224:8000         10.160.41.231:24937         CLOSE_WAIT

I do see those CLOSE_WAIT count increasing for an approximate load of 400 requests/seconds. I did try 8.1.0.RC1 and latest SNAPSHOT and the issue is still there. Not all the CLOSE_WAIT are reclaimed by the OS. 

If I don't front Jetty with ELB, The number of CLOSE_WAIT is close to 5 times lower, but I still see persietent one. 

I will update this issue with network/traffic information between ELB and Jetty soon.


Reproducible: Always
Comment 1 Greg Wilkins CLA 2011-12-23 03:21:21 EST
I've added some unit tests to 
jetty-websocket/src/test/java/org/eclipse/jetty/websocket/WebSocketMessageRFC6455Test.java

to try to reproduce, but no joy.  Neither testTCPClose nor testTCPHalfClose leaves a connection in CLOSE_WAIT.

Could you try to reproduce in a test harness?  or capture a TCP/IP trace?
Comment 2 jfarcand CLA 2011-12-23 08:47:33 EST
OK I will try, it need to run behind ELB. Now as you expected, I've set the websocket/continuation timeout lower than the ELB timeout and almost all the CLOSE_WAIT are gone. So it's clearly ELB that cause that. More information coming today or next year :-)
Comment 3 Jan Bartel CLA 2012-08-30 01:17:13 EDT
Joakim,

maybe you can take over looking at this issue, given your focus on websocket?

thanks
Jan
Comment 4 Joakim Erdfelt CLA 2013-01-23 11:28:05 EST
Considering the websocket side of this fixed.

Moving to documentation component for addition to documentation about this situation.

Leaving open and assigned to me.
Comment 5 Ron Gonzalez CLA 2013-09-05 09:29:56 EDT
We also experience the same issue since Jetty 6.1.26
We patched this problem by adding a call to _socket.close() in SocketEndpoint.java in shutdownoutput.

The only reason we figured out what the problem was is because we kept getting exceptions method not allowed for SSL Sockets.  We performed the following diff to note the following changes:

http://grepcode.com/file_/repo1.maven.org/maven2/org.mortbay.jetty/jetty/6.1.26/org/mortbay/io/bio/SocketEndPoint.java/?v=diff&id2=6.1.25

This issue does caused leaked file descriptors which eventually crashes applications or degrades the performance significantly.

Originally, we thought the issue was related to HTTP 1.0's lack of connection keep alive when sending requests to the SSL listener.  We were able to reproduce on jetty 6.1.26 using the following python script (but we cannot reproduce the same behavior on 7).  We do know that it happens when using Load Balancers, proxies, or other intermediate devices.

import urllib
import time
while (1):
f = urllib.urlopen("http://<host>:port/resource")
print f.read()
#time.sleep(1)

What we do know is that this issue does affect NIO as well as BIO and that we might patch it by adding a call to close the socket in nio/channelendpoint.java "shutdownoutput" method.
Comment 6 Ron Gonzalez CLA 2013-09-05 10:30:23 EDT
I found the following comment in the documentation here:
http://www.eclipse.org/jetty/documentation/current/configuring-connectors.html

soLingerTime	A value >=0 set the socket SO_LINGER value in milliseconds. Jetty attempts to gently close all TCP/IP connections with proper half close semantics, so a linger timeout should not be required and thus the default is -1.

with a link to the following discussion:
http://stackoverflow.com/questions/3757289/tcp-option-so-linger-zero-when-its-required


I believe however that load balancers / proxies are not allowing for clean connection closes, resulting in hundreds of connections in TIME_WAIT and for the jetty process to eventually hit the process file descriptor limit.

Are there any other prescribed workarounds for this other than setting soLingerTime?
Comment 7 Jan Bartel CLA 2013-11-07 22:23:19 EST
Joakim,

Assigning back to you to do the doco.

Jan
Comment 8 Joakim Erdfelt CLA 2014-07-17 16:00:30 EDT
Closing as documentation is updated about timeouts.