Community
Participate
Working Groups
I am using version 2.0.1.v20070816. I tried to copy about 20 files from a local folder to a remote frp site. It copied 6 files then it got stuck.
Created attachment 77956 [details] the relevant part of the stack trace "Worker-1810" prio=6 tid=0x08fce6b0 nid=0x1538 in Object.wait() [0x09daf000..0x09dafae8] at java.lang.Object.wait(Native Method) - waiting on <0x1a423550> (a java.util.LinkedList) at org.eclipse.rse.services.Mutex.waitForLock(Mutex.java:103) - locked <0x1a423550> (a java.util.LinkedList) at org.eclipse.rse.internal.services.files.ftp.FTPService.internalFetch(FTPService.java:589) at org.eclipse.rse.services.files.AbstractFileService.getFilesAndFolders(AbstractFileService.java:47) at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.getFilesAndFolders(FileServiceSubSystem.java:293) at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.listFoldersAndFiles(FileServiceSubSystem.java:331) at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.listFoldersAndFiles(RemoteFileSubSystem.java:949) at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.internalResolveOneFilterString(RemoteFileSubSystem.java:842) at org.eclipse.rse.subsystems.files.core.subsystems.RemoteFileSubSystem.internalResolveFilterString(RemoteFileSubSystem.java:813) at org.eclipse.rse.core.subsystems.SubSystem.resolveFilterString(SubSystem.java:2115) at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.internalGetChildren(SystemViewRemoteFileAdapter.java:680) - locked <0x17e86158> (a org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter) at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.getChildren(SystemViewRemoteFileAdapter.java:550) at org.eclipse.rse.ui.operations.SystemFetchOperation.execute(SystemFetchOperation.java:265) at org.eclipse.rse.ui.operations.SystemFetchOperation.run(SystemFetchOperation.java:128) at org.eclipse.rse.ui.view.AbstractSystemViewAdapter.fetchDeferredChildren(AbstractSystemViewAdapter.java:1970) at org.eclipse.ui.progress.DeferredTreeContentManager$1.run(DeferredTreeContentManager.java:196) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55) "Worker-1784" prio=6 tid=0x0d6b0400 nid=0xf08 in Object.wait() [0x0d92f000..0x0d92fbe8] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at org.apache.commons.net.telnet.TelnetInputStream.read(TelnetInputStream.java:339) - locked <0x1a553918> (a [I) at org.apache.commons.net.telnet.TelnetInputStream.read(TelnetInputStream.java:466) at java.io.BufferedInputStream.read1(BufferedInputStream.java:254) at java.io.BufferedInputStream.read(BufferedInputStream.java:313) - locked <0x1a558a40> (a java.io.BufferedInputStream) at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411) at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183) - locked <0x1a56bf70> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:167) at java.io.BufferedReader.fill(BufferedReader.java:136) at java.io.BufferedReader.readLine(BufferedReader.java:299) - locked <0x1a56bf70> (a java.io.InputStreamReader) at java.io.BufferedReader.readLine(BufferedReader.java:362) at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:264) at org.apache.commons.net.ftp.FTP.getReply(FTP.java:605) at org.apache.commons.net.ftp.FTPClient.completePendingCommand(FTPClient.java:1253) at org.eclipse.rse.internal.services.files.ftp.FTPService.upload(FTPService.java:727) at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.upload(FileServiceSubSystem.java:490) at org.eclipse.rse.files.ui.resources.UniversalFileTransferUtility.copyWorkspaceResourcesToRemote(UniversalFileTransferUtility.java:1108) at org.eclipse.rse.internal.files.ui.view.SystemViewRemoteFileAdapter.doDrop(SystemViewRemoteFileAdapter.java:1777) at org.eclipse.rse.internal.ui.view.SystemDNDTransferRunnable.transferRSEResources(SystemDNDTransferRunnable.java:214) at org.eclipse.rse.internal.ui.view.SystemDNDTransferRunnable.runInWorkspace(SystemDNDTransferRunnable.java:589) at org.eclipse.core.internal.resources.InternalWorkspaceJob.run(InternalWorkspaceJob.java:38) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)
Created attachment 77957 [details] the console output
Did all of Eclipse get stuck, or just the FTP transfer? Did you find any workaround to get it going again (e.g. disconnect/reconnect the FTP connection; press cancel on download job in progress view)? The logs and backtrace seem to indicate that your remote FTP server failed to return a completion notice after uploading file "10663_csDoc_microsoft-raises-ante-in-price-wars.htm" If you still have the COMPLETE thread dump, please attach it so we can see if any other thread is blocking the TelnetClient where commons.net is waiting for the remote answer (Should not be, but who knows). Because all FTP transfer currently runs synchronously, the directory retrieval is also blocked until the data transfer completes. This will be improved when fixing bug 198636, which should ensure that all data transfer is separate from directory retrievals. If it turns out that we'll need to be aware of unreliable remote FTP servers, and the Jakarta Commons Net FTP Client does not handle things like missing answers from the remote (e.g. by throwing an exception), we might have to implement either a) some kind of watchdog that automatically closes and re-tries a connection when the server fails to answer within a given time, or b) some additional thread checking for "Cancel" or "Disconnect" such that all pending transfers can at least be canceled by the user. Would one of these options be desirable for you? Javier: Can you try and find out if/what Commons Net offers in order to deal with unreliable remote servers?
> Did all of Eclipse get stuck, or just the FTP transfer? no, only the transfer > Did you find any workaround to get it going again (e.g. disconnect/reconnect > the FTP connection; press cancel on download job in progress view)? No. I just used the windows explorer to do the transfer.
Created attachment 77976 [details] the complete stack trace
From the full thread dump, it seems obvious that only Commons Net is waiting for remote answer to completePendingCommand(), and Thread-115 is simply still waiting for the answer from the socket: "Thread-115" daemon prio=6 tid=0x0d6e2790 nid=0x944 runnable [0x0b32f000..0x0b32fae8] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.read1(BufferedInputStream.java:254) at java.io.BufferedInputStream.read(BufferedInputStream.java:313) - locked <0x1a542190> (a java.io.BufferedInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) - locked <0x1a5421b0> (a org.apache.commons.net.telnet.TelnetInputStream) at org.apache.commons.net.telnet.TelnetInputStream.__read(TelnetInputStream.java:114) at org.apache.commons.net.telnet.TelnetInputStream.run(TelnetInputStream.java:535) at java.lang.Thread.run(Thread.java:595) For me, it looks like what we're missing is the ability to cancel such stalled transfer, because it blocks all other FTP operations. Cancellation should either be possible a) through the Eclipse Progress View, or b) By disconnecting the connection, or c) Automatically if the answer doesn't come within a specified timeout. In that case, RSE should disconnect/reconnect and retry the transfer. The situation will certainly improve when bug 198636 to allow simultaneous downloads/uploads, but even then we'll need an ability to cancel stalled transfers. So, Javier - could you please investigate how a stalled transfer can be canceled in Commons Net.
Michael could you check whether the issue is reproducable with the file 10663_csDoc_microsoft-raises-ante-in-price-wars.htm Perhaps the issue is because this file is being transferred in ASCII mode, or because the remote system's disk got full? - If the issue is reproducable, we might find a better fix.
The remote file system was not full. (windows explorer was able to deliver the files) It is reproducible (I tried it again and it hangs again). There's a timeout after a long time, so it gets out of the call. The file contains some control characters. Maybe that's the problem. I think the ascii protocol of ftp is a bad idea anyway. Why does a file transfer change my data (by replacing CR LF with LF). That's the job of a tool that converts dos files to unix. I set the .html filetype to use binary protocol and I get the same problem, so it's not a problem with the ascii mode.....
Created attachment 77991 [details] eclipse hangs at sutdown when ftp gets into this mode When the ftp connection is hanging, eclipse does not shut down and keeps locking the workspace....
(In reply to comment #8) > It is reproducible (I tried it again and it hangs again). There's a timeout > after a long time, so it gets out of the call. When it's reproducable with that specific file, could you attach that file here? > I set the .html filetype to use binary protocol and I get the same problem, so > it's not a problem with the ascii mode..... Did you verify with the FTP console log that it does actually switch to I mode?
Created attachment 78057 [details] the file causing problems i was using binary transfer
Hm. I can upload that file to a Solaris FTP server without any issues. I'm using all default settings, which leads to uploading in FTP active mode as ASCII: STOR 10663_csDoc_microsoft-raises-ante-in-price-wars.htm 150 Opening ASCII mode data connection for 10663_csDoc_microsoft-raises-ante-in-price-wars.htm. 226 Transfer complete.
Created attachment 78076 [details] A code snippet that makes ftp hang (from time to time) It hangs from time to time. This looks very much like a threading problem in FTPClient.
I'm quite sure that this is an issue in Jakarta Commons Net. They have multiple references to FTP and/or TelnetInputStream deadlocking occasionally, and they claim that it will be fixed in Commons Net 2.0 (which requires Java 1.5!!) Here are some sample references - I didn't find the original fix: http://issues.apache.org/jira/browse/NET-145 http://issues.apache.org/jira/browse/NET-31 http://issues.apache.org/jira/browse/NET-65
I think I finally found the original issue with Commons Net: http://issues.apache.org/jira/browse/NET-3 It looks like the patch was applied for an 1.4.x version which was never officially released.
(In reply to comment #15) > I think I finally found the original issue with Commons Net: > http://issues.apache.org/jira/browse/NET-3 > > It looks like the patch was applied for an 1.4.x version which was never > officially released. > There are two Commons Net bugs that result in deadlocks: https://issues.apache.org/jira/browse/NET-3 https://issues.apache.org/jira/browse/NET-73 I wrote the patches for both. Unfortunately, there has not been a maintenance release in a long time, so the only way to get them on top of 1.4.1 is through a custom build. My company has been running in production this way for over a year without deadlocks. (In our case, though, Telnet was the problem. But I think FTP client uses the same code.)
Thanks a lot for those pointers Rob! I'd very much like to apply those patches for the version of Commons Net we use at Eclipse! But I still need to check and find out whether I can do this, both from a versioning and an IP cleanliness point of view. To clarify the second of the two, and for the lack of Copyright Comments in your patches: I take it that 1.) you are the original Author of the two patches (11 and 16 lines of code respectively), 2.) that you did not reference any other 3rd party code, 3.) that you license these patches under the Apache Public License 2.0 (the one that Commons Net is licensed under), and 4.) That you are authorized by your employer to make that contribution under the APL 2.0. Correct so far? Could you also dual-license the code under the Eclipse Public License (EPL)? I don't really think it's necessary since the EPL and the APL are said to be compatible, but if you're OK with licensing under EPL as well it might speed up IP things for us. Thanks!
(In reply to comment #17) > To clarify the second of the two, and for the lack of Copyright Comments in > your patches: I take it that > 1.) you are the original Author of the two patches (11 and 16 lines of code > respectively), > 2.) that you did not reference any other 3rd party code, > 3.) that you license these patches under the Apache Public License 2.0 > (the one that Commons Net is licensed under), and > 4.) That you are authorized by your employer to make that contribution under > the APL 2.0. > > Correct so far? > > Could you also dual-license the code under the Eclipse Public License (EPL)? I > don't really think it's necessary since the EPL and the APL are said to be > compatible, but if you're OK with licensing under EPL as well it might speed up > IP things for us. > > Thanks! > Answer to all of your questions is "yes". The Eclipse community is free to use these patches as it wishes under the EPL.
Created attachment 78218 [details] Patch for NET-3 (hanging_read_fix.patch)
Created attachment 78220 [details] Patch for NET-73 (cmd-seq-hang.patch)
(In reply to comment #18) > Answer to all of your questions is "yes". The Eclipse community is free to use > these patches as it wishes under the EPL. Thanks Rob! - I created the attachments. These are to be applied to org/apache/commons/net/telnet/TelnetInputStream.java and are 11 and 16 lines of code, respectively. The patch for NET-3 needs to be applied before the patch for NET-73.
Patches applied. Michael can you sync HEAD of org.apache.commons.net into your workspace and verify?
I tried to verify it, but today my test always succeeds even without applying the patch. The problem seems only to occur under certain conditions which probably depend on the network speed..... That's the bad thing about those race condition problems, that they are so hard to reproduce. When I reported the bug I was able to reproduce it quite reliably, but now the conditions have changed....
Both bugs involve race conditions, so they are hard to reproduce. I could do it reliably only by carefully controlling the traffic going over the wire and the amount of data pulled from the queue with each read operation.