Bug 300394 - [ftp] Deadlock due to NOOP command not receiving response on the main thread
Summary: [ftp] Deadlock due to NOOP command not receiving response on the main thread
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 3.1.1   Edit
Hardware: PC Linux
: P1 critical (vote)
Target Milestone: 3.2 M6   Edit
Assignee: Martin Oberhuber CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords: helpwanted
: 232915 255117 260796 280515 281731 300455 (view as bug list)
Depends on: 243921 269171
Blocks: 300455 305986
  Show dependency tree
 
Reported: 2010-01-21 11:17 EST by Joost Timmerman CLA
Modified: 2010-03-16 09:36 EDT (History)
6 users (show)

See Also:


Attachments
In this session RSE crashed (5.82 KB, text/x-log)
2010-01-21 11:17 EST, Joost Timmerman CLA
no flags Details
a JStack trace of a deadlocked eclipse (18.95 KB, text/plain)
2010-01-25 10:11 EST, Joost Timmerman CLA
no flags Details
Simple patch fixing the issue (2.66 KB, patch)
2010-01-26 04:06 EST, Martin Oberhuber CLA
no flags Details | Diff
Stacktrace of a patched deadlocked RSE (27.61 KB, text/plain)
2010-01-26 14:21 EST, Joost Timmerman CLA
no flags Details
Console output of Eclipse launching a patched RSE (854 bytes, text/plain)
2010-01-26 14:21 EST, Joost Timmerman CLA
no flags Details
Patched RSE/Eclipse crashing (18.57 KB, text/plain)
2010-01-26 14:21 EST, Joost Timmerman CLA
no flags Details
Additional patch (5.70 KB, patch)
2010-03-15 19:21 EDT, Martin Oberhuber CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joost Timmerman CLA 2010-01-21 11:17:01 EST
Created attachment 156810 [details]
In this session RSE crashed

I'm not exactly sure what is happening here. I'm editing some files on my server at home, and suddenly my Eclipse crashes. At some point RSE can't sync or list a file or folder contents. The process of doing so is pending in the background. Every action after this (save, refresh etc) gets me to a full crash where Eclipse won't respond to anything i do and i have to manually kill the process.

I can't really get much clearer then this. Sometimes i can work for hours without any crashes, sometimes i have crashes for every 2 files i'm trying to edit.

In the meanwhile my files are not saved or synced so I'm losing work here.

I am using 
* Eclipse 3.5.1 from the ubuntu repositories
* Ubuntu... (2.6.31-17, but had same with 16)
* RSE End-User Runtime	3.1.1.v200907301400
* java-6-openjdk

Today i've been so lucky (...) to have lots of crashes, so it was pretty easy to get a clean logfile. Attached file ([projectfolder]/.metadata/.log) was empty when i started the session, and this is whats left after i killed the process. However it doesn't look very informative for this particular problem. 

The logs in .metadata/.plugins/org.eclipse.rse.* are empty. Are there any more places i can look?
Comment 1 Martin Oberhuber CLA 2010-01-22 04:27:33 EST
From your description, it sounds like you are experiencing a deadlock (product gets unresponsive), and not a crash -- correct?

Fortunately, it is quite easy to get a thread dump out of the deadlocked application, and such a thread dump typically makes analyzing the situation quite straightforward. See
   http://wiki.eclipse.org/How_to_report_a_deadlock
for how to obtain the thread dump, please attach it here on the bug.

What connection type are you using in RSE, I assume SSH?

I increased priority and severity based on your description.
Thanks for reporting the exact version numbers used.
Comment 2 Joost Timmerman CLA 2010-01-25 10:11:31 EST
Created attachment 157127 [details]
a JStack trace of a deadlocked eclipse
Comment 3 Joost Timmerman CLA 2010-01-25 10:15:45 EST
I believe you are right about it being a deadlock rather than a crash.

I had no luck killing the process with kill -3, it did not give me any output nor did it kill the application. I used the jstack application to get the output i posted in a new attachment.

I am connected to this remote machine via (and only via) FTP.
Comment 4 Martin Oberhuber CLA 2010-01-25 11:38:44 EST
Now the thread dump does help!

Quite apparently the problem is that multiple Threads send a NOOP command to the remote and expect a response; but since the channel to the remote is not multi-threaded it is not clear what channel should receive the response, and some senders end up waiting forever. If the main Thread is waiting, Eclipse is dead.

The real problem here is that in the org.eclipse.rse.services.files.ftp plugin,
   FTPService.isConnected()
communicates with the remote via the command channel (by sending a NOOP and expecting a response), but this communication is not gated through the _commandMutex semaphore to ensure single-threaded communication.

I think that this must be fixed together with bug 269171. Since the isConnected() method can apparently be called on the main Thread, we must make sure that isConnected() can always return quickly without waiting for a remote response to return. 

One option for doing so is relying on local socket state to see whether we are connected; another is caching the connected state and updating that cached state in a background thread if a timeout has occurred. In both cases, we may need to revisit the current feature of automatically re-connecting when the remote server has timed out on the connection and closed the socket (but we still think that we are connected). Comparing behavior against what we do in SSH should help getting the logic right.

I'll be happy to help any interested community contributor with getting this right and will accept patches. As mentioned, looking at FTPService.java for fixing this should be sufficient.
Comment 5 Martin Oberhuber CLA 2010-01-26 04:06:10 EST
Created attachment 157224 [details]
Simple patch fixing the issue

Attached is a simple patch fixing this issue.

This patch introduces a regression for bug 243921 -- connection timeout will likely lead to an error expanding folders again. But this product hang-up issue seems more severe than bug 243921.

Joost, do you think you have a chance testing this patch?

Ideas for solving bug 243921 differently (without sending NOOP) would be appreciated.
Comment 6 Joost Timmerman CLA 2010-01-26 14:21:38 EST
Created attachment 157328 [details]
Stacktrace of a patched deadlocked RSE
Comment 7 Joost Timmerman CLA 2010-01-26 14:21:44 EST
Created attachment 157329 [details]
Console output of Eclipse launching a patched RSE
Comment 8 Joost Timmerman CLA 2010-01-26 14:21:48 EST
Created attachment 157330 [details]
Patched RSE/Eclipse crashing
Comment 9 Joost Timmerman CLA 2010-01-26 14:22:42 EST
My results so far. I'm at a WinXP sp3 machine now, same RSE version and Sun
Java 1.6.0_14

1. I created a new workspace
2. Imported the cvs workspaces using
http://www.eclipse.org/dsdp/tm/development/rse-anonymous.psf
3. I applied the patch to org.eclise.rse.services.files.ftp
4. Made an Eclipse Application run configuration (detault settings)

5.
On the first launch i only got a firewall (windows) warning about java wanted
to access internet, I choose to block it (yeah, weird right?). I got some
messages in my console (see output1.txt). Nearly all of my features are
installed, including rse.

6.
When i connected to an ftp and opened the root directory, Eclipse deadlocked,
although it still paints the window once in a while (and its 'Running' in
taskmanager). The stacktrace is in stacktrace1.txt.

The launching Eclipse now keeps getting (with one or two minutes interval)
Error while logging event loop exception: java.lang.OutOfMemoryError: PermGen
space
Logging exception: java.lang.OutOfMemoryError: PermGen space

I have 1 Eclipse process which is steady using about 275Mb ram
I have 1 javaw process which is using about 200~220Mb, clicking stuff or moving
makes cpu usage skyrocket, otherwise its calm

7
Killed all and tried again. Lets try and allow Eclipse to reach the FTP in the
first place now... (that is, at launching the app, not on connecting to ftp)

8.
I succesfully connected to the ftp, but opening the file make it crash (crash
this time because the launching eclipse caught and killed it). I shortly got a
(uncopyable) error popup saying the php editor couldn't be opened. See log1.txt

9.
Relauching, opened the file the regular text editor now. It succesfully opened
the file but gave a ParmGen space error on intializing php tooling. 

Is it worth trying to disable or avoid using anything else than RSE or did the
patch break something it shouldn't have?
Comment 10 Joost Timmerman CLA 2010-01-26 14:27:25 EST
I see my file-names are not kept in here

Stacktrace of a patched deadlocked RSE - stacktrace1.txt
Console output of Eclipse launching a patched RSE - output1.txt
Patched RSE/Eclipse crashing - Forgot to mention this, its the log from the launch where i opened a file but the php editor couldn't be opened
Comment 11 Martin Oberhuber CLA 2010-01-27 13:10:44 EST
Comment on attachment 157329 [details]
Console output of Eclipse launching a patched RSE

Hiding irrelevant messages
Comment 12 Martin Oberhuber CLA 2010-01-27 13:16:16 EST
Comment on attachment 157330 [details]
Patched RSE/Eclipse crashing

Hiding irrelevant .log
Comment 13 Martin Oberhuber CLA 2010-01-27 13:21:13 EST
(In reply to comment #9)

I cannot see anything related to RSE in your thread dump. From your description, it looks like the problem is just that your host-eclipse ran out of PermGen memory (not surprising, looking at how much stuff you seem to have with Aptana, DLTK etc). The OutOfMem of your debug host seems to cause the slow response times in your debuggee.

Try launching your debug host with 

-vmargs -XX:MaxPermSize=256m

as described in section "4. Running Eclipse" of the file readme/readme_eclipse.html that comes with your Eclipse SDK.
Comment 14 Martin Oberhuber CLA 2010-03-15 19:21:01 EDT
Created attachment 162117 [details]
Additional patch

Attached additional patch ensures that

* NOOP is not sent more often that every 30 seconds for checking connection
* NOOP is never sent outside command mutex protected region
* Console logging does not print plaintext password after reconnect
* setReadOnly() and setTimeStamp() are capable of reconnect-on-demand if needed

I ran a couple of tests on multiple hosts, allowing connections to run into timeouts. Apparently the original issue from bug 243921 which led to adding the NOOP commands and thus introduced this regression is now fixed differently, so the excessive sending of NOOP in isConnected() is no longer necessary.

I think that this issue is finally fixed now.
Comment 15 Martin Oberhuber CLA 2010-03-15 19:24:22 EDT
Released into 3.2M6.

Thanks Joost for your thread dumps and diligent investigations, which finally made fixing this possible.
Comment 16 Martin Oberhuber CLA 2010-03-15 19:29:16 EDT
*** Bug 280515 has been marked as a duplicate of this bug. ***
Comment 17 Martin Oberhuber CLA 2010-03-15 19:34:18 EDT
*** Bug 281731 has been marked as a duplicate of this bug. ***
Comment 18 Martin Oberhuber CLA 2010-03-15 19:36:03 EDT
*** Bug 260796 has been marked as a duplicate of this bug. ***
Comment 19 Martin Oberhuber CLA 2010-03-15 19:38:45 EDT
*** Bug 300455 has been marked as a duplicate of this bug. ***
Comment 20 Martin Oberhuber CLA 2010-03-15 19:40:26 EDT
*** Bug 255117 has been marked as a duplicate of this bug. ***
Comment 21 Martin Oberhuber CLA 2010-03-15 19:53:45 EDT
*** Bug 232915 has been marked as a duplicate of this bug. ***