Bug 202758 - [ftp] ftp hangs when I try to copy some files to a remote site
Summary: [ftp] ftp hangs when I try to copy some files to a remote site
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 2.0   Edit
Hardware: PC Windows XP
: P2 major (vote)
Target Milestone: 2.0.1   Edit
Assignee: Martin Oberhuber CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords: contributed
Depends on:
Blocks: 204246
  Show dependency tree
 
Reported: 2007-09-09 18:43 EDT by Michael Scharf CLA
Modified: 2007-09-21 05:22 EDT (History)
1 user (show)

See Also:


Attachments
the relevant part of the stack trace (4.15 KB, text/plain)
2007-09-09 18:44 EDT, Michael Scharf CLA
no flags Details
the console output (11.74 KB, patch)
2007-09-09 18:46 EDT, Michael Scharf CLA
no flags Details | Diff
the complete stack trace (23.69 KB, text/plain)
2007-09-10 07:56 EDT, Michael Scharf CLA
no flags Details
eclipse hangs at sutdown when ftp gets into this mode (17.17 KB, text/plain)
2007-09-10 11:49 EDT, Michael Scharf CLA
no flags Details
the file causing problems (40.59 KB, text/html)
2007-09-11 08:59 EDT, Michael Scharf CLA
no flags Details
A code snippet that makes ftp hang (from time to time) (1.26 KB, text/plain)
2007-09-11 12:40 EDT, Michael Scharf CLA
no flags Details
Patch for NET-3 (hanging_read_fix.patch) (1.70 KB, patch)
2007-09-12 14:27 EDT, Martin Oberhuber CLA
no flags Details | Diff
Patch for NET-73 (cmd-seq-hang.patch) (2.66 KB, patch)
2007-09-12 14:28 EDT, Martin Oberhuber CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Scharf CLA 2007-09-09 18:43:46 EDT
I am using version 2.0.1.v20070816. I tried to copy about 20 files from a local folder to a remote frp site. It copied 6 files then it got stuck.
Comment 1 Michael Scharf CLA 2007-09-09 18:44:59 EDT
Created attachment 77956 [details]
the relevant part of the stack trace

"Worker-1810" prio=6 tid=0x08fce6b0 nid=0x1538 in Object.wait() [0x09daf000..0x09dafae8]
	at java.lang.Object.wait(Native Method)
	- waiting on <0x1a423550> (a java.util.LinkedList)
	at org.eclipse.rse.services.Mutex.waitForLock(Mutex.java:103)
	- locked <0x1a423550> (a java.util.LinkedList)
	at org.eclipse.rse.internal.services.files.ftp.FTPService.internalFetch(FTPService.java:589)
	at org.eclipse.rse.services.files.AbstractFileService.getFilesAndFolders(AbstractFileService.java:47)
	at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.getFilesAndFolders(FileServiceSubSystem.java:293)
	at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.listFoldersAndFiles(FileServiceSubSystem.java:331)
	at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.listFoldersAndFiles(RemoteFileSubSystem.java:949)
	at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.internalResolveOneFilterString(RemoteFileSubSystem.java:842)
	at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.internalResolveFilterString(RemoteFileSubSystem.java:813)
	at org.eclipse.rse.core.subsystems.SubSystem.resolveFilterString(SubSystem.java:2115)
	at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.internalGetChildren(SystemViewRemoteFileAdapter.java:680)
	- locked <0x17e86158> (a org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter)
	at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.getChildren(SystemViewRemoteFileAdapter.java:550)
	at org.eclipse.rse.ui.operations.SystemFetchOperation.execute(SystemFetchOperation.java:265)
	at org.eclipse.rse.ui.operations.SystemFetchOperation.run(SystemFetchOperation.java:128)
	at org.eclipse.rse.ui.view.AbstractSystemViewAdapter.fetchDeferredChildren(AbstractSystemViewAdapter.java:1970)
	at org.eclipse.ui.progress.DeferredTreeContentManager$1.run(DeferredTreeContentManager.java:196)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)

"Worker-1784" prio=6 tid=0x0d6b0400 nid=0xf08 in Object.wait() [0x0d92f000..0x0d92fbe8]
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:474)
	at org.apache.commons.net.telnet.TelnetInputStream.read(TelnetInputStream.java:339)
	- locked <0x1a553918> (a [I)
	at org.apache.commons.net.telnet.TelnetInputStream.read(TelnetInputStream.java:466)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	- locked <0x1a558a40> (a java.io.BufferedInputStream)
	at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411)
	at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
	- locked <0x1a56bf70> (a java.io.InputStreamReader)
	at java.io.InputStreamReader.read(InputStreamReader.java:167)
	at java.io.BufferedReader.fill(BufferedReader.java:136)
	at java.io.BufferedReader.readLine(BufferedReader.java:299)
	- locked <0x1a56bf70> (a java.io.InputStreamReader)
	at java.io.BufferedReader.readLine(BufferedReader.java:362)
	at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:264)
	at org.apache.commons.net.ftp.FTP.getReply(FTP.java:605)
	at org.apache.commons.net.ftp.FTPClient.completePendingCommand(FTPClient.java:1253)
	at org.eclipse.rse.internal.services.files.ftp.FTPService.upload(FTPService.java:727)
	at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.upload(FileServiceSubSystem.java:490)
	at org.eclipse.rse.files.ui.resources.UniversalFileTransferUtility.copyWorkspaceResourcesToRemote(UniversalFileTransferUtility.java:1108)
	at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.doDrop(SystemViewRemoteFileAdapter.java:1777)
	at org.eclipse.rse.internal.ui.view.SystemDNDTransferRunnable.transferRSEResources(SystemDNDTransferRunnable.java:214)
	at org.eclipse.rse.internal.ui.view.SystemDNDTransferRunnable.runInWorkspace(SystemDNDTransferRunnable.java:589)
	at org.eclipse.core.internal.resources.InternalWorkspaceJob.run(InternalWorkspaceJob.java:38)
	at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)
Comment 2 Michael Scharf CLA 2007-09-09 18:46:15 EDT
Created attachment 77957 [details]
the console output
Comment 3 Martin Oberhuber CLA 2007-09-10 07:39:28 EDT
Did all of Eclipse get stuck, or just the FTP transfer?
Did you find any workaround to get it going again (e.g. disconnect/reconnect
the FTP connection; press cancel on download job in progress view)?

The logs and backtrace seem to indicate that your remote FTP server failed to return a completion notice after uploading file "10663_csDoc_microsoft-raises-ante-in-price-wars.htm"
If you still have the COMPLETE thread dump, please attach it so we can see if any other thread is blocking the TelnetClient where commons.net is waiting for the remote answer (Should not be, but who knows).

Because all FTP transfer currently runs synchronously, the directory retrieval is also blocked until the data transfer completes. This will be improved when fixing bug 198636, which should ensure that all data transfer is separate from directory retrievals. 

If it turns out that we'll need to be aware of unreliable remote FTP servers, and the Jakarta Commons Net FTP Client does not handle things like missing answers from the remote (e.g. by throwing an exception), we might have to implement either
  a) some kind of watchdog that automatically closes and re-tries a connection 
     when the server fails to answer within a given time, or
  b) some additional thread checking for "Cancel" or "Disconnect" such that 
     all pending transfers can at least be canceled by the user.

Would one of these options be desirable for you?

Javier: Can you try and find out if/what Commons Net offers in order to deal with unreliable remote servers?
Comment 4 Michael Scharf CLA 2007-09-10 07:55:55 EDT
> Did all of Eclipse get stuck, or just the FTP transfer?
no, only the transfer
> Did you find any workaround to get it going again (e.g. disconnect/reconnect
> the FTP connection; press cancel on download job in progress view)?
No. I just used the windows explorer to do the transfer. 
Comment 5 Michael Scharf CLA 2007-09-10 07:56:35 EDT
Created attachment 77976 [details]
the complete stack trace
Comment 6 Martin Oberhuber CLA 2007-09-10 08:10:02 EDT
From the full thread dump, it seems obvious that only Commons Net is waiting for remote answer to completePendingCommand(), and Thread-115 is simply still waiting for the answer from the socket:

"Thread-115" daemon prio=6 tid=0x0d6e2790 nid=0x944 runnable [0x0b32f000..0x0b32fae8]
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:129)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	- locked <0x1a542190> (a java.io.BufferedInputStream)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
	- locked <0x1a5421b0> (a org.apache.commons.net.telnet.TelnetInputStream)
	at org.apache.commons.net.telnet.TelnetInputStream.__read(TelnetInputStream.java:114)
	at org.apache.commons.net.telnet.TelnetInputStream.run(TelnetInputStream.java:535)
	at java.lang.Thread.run(Thread.java:595)


For me, it looks like what we're missing is the ability to cancel such stalled transfer, because it blocks all other FTP operations. Cancellation should either be possible 
  a) through the Eclipse Progress View, or
  b) By disconnecting the connection, or
  c) Automatically if the answer doesn't come within a specified timeout.
     In that case, RSE should disconnect/reconnect and retry the transfer.

The situation will certainly improve when bug 198636 to allow simultaneous downloads/uploads, but even then we'll need an ability to cancel stalled transfers. So, Javier - could you please investigate how a stalled transfer can be canceled in Commons Net.
Comment 7 Martin Oberhuber CLA 2007-09-10 08:13:37 EDT
Michael could you check whether the issue is reproducable with the file
    10663_csDoc_microsoft-raises-ante-in-price-wars.htm

Perhaps the issue is because this file is being transferred in ASCII mode, or because the remote system's disk got full? - If the issue is reproducable, we might find a better fix.
Comment 8 Michael Scharf CLA 2007-09-10 11:12:46 EDT
The remote file system was not full. (windows explorer was able to deliver the files)
It is reproducible (I tried it again and it hangs again). There's a timeout after a long time, so it gets out of the call.

The file contains some control characters. Maybe that's the problem. I think the ascii protocol of ftp is a bad idea anyway. Why does a file transfer change my data (by replacing CR LF with LF). That's the job of a tool that converts dos files to unix.

I set the .html filetype to use binary protocol and I get the same problem, so it's not a problem with the ascii mode.....

Comment 9 Michael Scharf CLA 2007-09-10 11:49:34 EDT
Created attachment 77991 [details]
eclipse hangs at sutdown when ftp gets into this mode

When the ftp connection is hanging, eclipse does not shut down and keeps locking the workspace....
Comment 10 Martin Oberhuber CLA 2007-09-11 07:42:18 EDT
(In reply to comment #8)
> It is reproducible (I tried it again and it hangs again). There's a timeout
> after a long time, so it gets out of the call.

When it's reproducable with that specific file, could you attach that file here?

> I set the .html filetype to use binary protocol and I get the same problem, so
> it's not a problem with the ascii mode.....

Did you verify with the FTP console log that it does actually switch to I mode?
Comment 11 Michael Scharf CLA 2007-09-11 08:59:49 EDT
Created attachment 78057 [details]
the file causing problems

i was using binary transfer
Comment 12 Martin Oberhuber CLA 2007-09-11 09:04:58 EDT
Hm. I can upload that file to a Solaris FTP server without any issues.
I'm using all default settings, which leads to uploading in FTP active mode as ASCII:

STOR 10663_csDoc_microsoft-raises-ante-in-price-wars.htm
150 Opening ASCII mode data connection for 10663_csDoc_microsoft-raises-ante-in-price-wars.htm.

226 Transfer complete.

Comment 13 Michael Scharf CLA 2007-09-11 12:40:14 EDT
Created attachment 78076 [details]
A code snippet that makes ftp hang (from time to time)

It hangs from time to time. This looks very much like a threading problem in FTPClient.
Comment 14 Martin Oberhuber CLA 2007-09-11 12:47:26 EDT
I'm quite sure that this is an issue in Jakarta Commons Net.
They have multiple references to FTP and/or TelnetInputStream deadlocking occasionally, and they claim that it will be fixed in Commons Net 2.0 (which requires Java 1.5!!)

Here are some sample references - I didn't find the original fix:

http://issues.apache.org/jira/browse/NET-145
http://issues.apache.org/jira/browse/NET-31
http://issues.apache.org/jira/browse/NET-65
Comment 15 Martin Oberhuber CLA 2007-09-11 12:57:08 EDT
I think I finally found the original issue with Commons Net:
http://issues.apache.org/jira/browse/NET-3

It looks like the patch was applied for an 1.4.x version which was never officially released.
Comment 16 Rob Hasselbaum CLA 2007-09-11 14:36:20 EDT
(In reply to comment #15)
> I think I finally found the original issue with Commons Net:
> http://issues.apache.org/jira/browse/NET-3
> 
> It looks like the patch was applied for an 1.4.x version which was never
> officially released.
> 

There are two Commons Net bugs that result in deadlocks:

https://issues.apache.org/jira/browse/NET-3
https://issues.apache.org/jira/browse/NET-73

I wrote the patches for both. Unfortunately, there has not been a maintenance release in a long time, so the only way to get them on top of 1.4.1 is through a custom build. My company has been running in production this way for over a year without deadlocks. (In our case, though, Telnet was the problem. But I think FTP client uses the same code.)
Comment 17 Martin Oberhuber CLA 2007-09-12 08:36:12 EDT
Thanks a lot for those pointers Rob!

I'd very much like to apply those patches for the version of Commons Net we use at Eclipse! But I still need to check and find out whether I can do this, both from a versioning and an IP cleanliness point of view.

To clarify the second of the two, and for the lack of Copyright Comments in your patches: I take it that
  1.) you are the original Author of the two patches (11 and 16 lines of code
      respectively),
  2.) that you did not reference any other 3rd party code,
  3.) that you license these patches under the Apache Public License 2.0
      (the one that Commons Net is licensed under), and
  4.) That you are authorized by your employer to make that contribution under 
      the APL 2.0.

Correct so far?

Could you also dual-license the code under the Eclipse Public License (EPL)? I don't really think it's necessary since the EPL and the APL are said to be compatible, but if you're OK with licensing under EPL as well it might speed up IP things for us.

Thanks!
Comment 18 Rob Hasselbaum CLA 2007-09-12 13:58:35 EDT
(In reply to comment #17)
> To clarify the second of the two, and for the lack of Copyright Comments in
> your patches: I take it that
>   1.) you are the original Author of the two patches (11 and 16 lines of code
>       respectively),
>   2.) that you did not reference any other 3rd party code,
>   3.) that you license these patches under the Apache Public License 2.0
>       (the one that Commons Net is licensed under), and
>   4.) That you are authorized by your employer to make that contribution under 
>       the APL 2.0.
> 
> Correct so far?
> 
> Could you also dual-license the code under the Eclipse Public License (EPL)? I
> don't really think it's necessary since the EPL and the APL are said to be
> compatible, but if you're OK with licensing under EPL as well it might speed up
> IP things for us.
> 
> Thanks!
> 

Answer to all of your questions is "yes". The Eclipse community is free to use these patches as it wishes under the EPL.
Comment 19 Martin Oberhuber CLA 2007-09-12 14:27:17 EDT
Created attachment 78218 [details]
Patch for NET-3 (hanging_read_fix.patch)
Comment 20 Martin Oberhuber CLA 2007-09-12 14:28:41 EDT
Created attachment 78220 [details]
Patch for NET-73 (cmd-seq-hang.patch)
Comment 21 Martin Oberhuber CLA 2007-09-12 14:30:58 EDT
(In reply to comment #18)
> Answer to all of your questions is "yes". The Eclipse community is free to use
> these patches as it wishes under the EPL.

Thanks Rob! - I created the attachments. These are to be applied to
   org/apache/commons/net/telnet/TelnetInputStream.java
and are 11 and 16 lines of code, respectively. The patch for NET-3 needs to be applied before the patch for NET-73.

Comment 22 Martin Oberhuber CLA 2007-09-12 15:33:19 EDT
Patches applied. Michael can you sync HEAD of org.apache.commons.net into your workspace and verify?
Comment 23 Michael Scharf CLA 2007-09-13 00:15:58 EDT
I tried to verify it, but today my test always succeeds even without applying the patch. The problem seems only to occur under certain conditions which probably depend on the network speed..... That's the bad thing about those race condition problems, that they are so hard to reproduce. When I reported the bug I was able to reproduce it quite reliably, but now the conditions have changed....
Comment 24 Rob Hasselbaum CLA 2007-09-13 10:52:47 EDT
Both bugs involve race conditions, so they are hard to reproduce. I could do it reliably only by carefully controlling the traffic going over the wire and the amount of data pulled from the queue with each read operation.