Bug 203306

Summary:

[ftp] Deadlock when comparing two files on FTP

Product:

[Tools] Target Management

Reporter:

Martin Oberhuber <mober.at+eclipse>

Component:

RSE

Assignee:

Martin Oberhuber <mober.at+eclipse>

Status:

RESOLVED FIXED

QA Contact:

Martin Oberhuber <mober.at+eclipse>

Severity:

critical

Priority:

CC:

dmcknigh, eclipse, javier.montalvoorus

Version:

2.0

Keywords:

investigate

Target Milestone:

2.0.1

Flags:

dmcknigh: review+
javier.montalvoorus: review+

Hardware:

OS:

Linux-GTK

Whiteboard:

Bug Depends on:

180965, 173995, 196662

Bug Blocks:

Attachments:

Description	Flags
Thread dump of the deadlock	none
Patch fixing the issue for AccumulatingProgressMonitor	none
Updated patch (close FTP stream before completePendingCommand)	none

Description Martin Oberhuber

2007-09-13 12:25:00 EDT

Created attachment 78335 [details]
Thread dump of the deadlock

On a remote FTP server, under My Home, create folers and files
   aatest/test.txt 
   bbtest/test.txt
Connect to remote FTP.

Select aatest/test.txt and bbtest/test.txt; 
Right-click > Compare with > Each other.
Depending on timing and what RSE has in the FTP stat-cache, RSE will deadlock.
It may be necessary to browse a different FTP folder (to clear the stat-cache)
before doing the compare, in order to provoke the deadlock.

The thread dump shows what happens:
* Compare-with action performs downloads in the ModalContextThread.
  Each download locks the FTP Mutex, and updates a ProgressMonitor.
* At the same time, RSE performs a "getFile()" call on the main thread.
  The getFile() cannot proceed because it waits on the FTP Mutex; 
  so the main thread is blocked.
* Because the main thread is blocked, the progress monitor update from the
  ModalContextThread can not continue -> Mutex is never released -> Deadlock.

The deadlock is kind of inevitable because of the way FTP works: There is only one channel shared by multiple threads, and the display thread also tries using that channel. So if either bug 180965 (multiple concurrent downloads) or bug 196662 (no queries on dispatch thread) were fixed, the problem would not occur.

Note that for SSH this issue would not occur because there are no actions which drive a progress monitor from inside the Mutex protected region; and for dstore, it should also not occur because there is no Mutex protected region.

I consider the issue critical since it blocks all of Eclipse.

-----------Enter bugs above this line-----------
TM 2.0.1 Testing
installation : eclipse-SDK-3.3-linux-gtk-x86_64, emf-sdo-xsd-SDK-2.3.0
     Download RSE-2.0.1RC1: RSE-SDK,examples,tests,discovery,terminal
java.runtime : Sun 1.5.0_12-b04  64-bit Server VM, mixed mode
os.name:     : OpenSUSE 10.2 64-bit
uname        : Linux osgiliath 2.6.18.8-0.5-default x86_64 GNU/Linux
------------------------------------------------
systemtype   : FTP-Only
targetos3    : Solaris-sparc 5.9, Sun 1.4.2_05
targetuname  : SunOS szg-anar 5.9 Generic_118558-06 sun4u sparc SUNW,Sun-Blade-1500
------------------------------------------------

Comment 1 Martin Oberhuber

2007-09-13 12:51:38 EDT

I'm finding the issue extremely hard to reproduce (can't get to it again - in spite of clearing cache, restarting everything etc).

Still I think that any deadlock we know of should be fixed as soon as possible, because it affects all of Eclipse and can thus lead to loss of data.

Unfortunately, right now I'm not sure how to fix this properly with a fix that does not need bug 180965 fixed. We would need to make sure that any code inside a protected region does not call out to the dispatch thread, such that we can ensure it can always complete. This would mean that we either have no progress reporting from the protected regions at all, or the progress reporting must be made asynchronous.

Another workaround (though not fix) might be to decrease the maximum wait time of the Mutex. Then, in case of a deadlock, it could complete when the maximum wait time is elapsed. Decreasing the wait time, however, could mean that in case of a long-going download the other query would not wait long enough and fail.

A variant of decreasing the wait time could be to decrease it only if the dispatch thread is affected (use e.g. 5 seconds if on the dispatch thread, or 2 minutes [or much longer?] if on a background thread). Returning failure due to elapsed Mutex timeout could happen frequently as long as downloads are also hogging the Mutex; it would happen much less frequently when downloads were asynchronous as requested by bug #196662.

Anybody have any other ideas?

Comment 2 Martin Oberhuber

2007-09-13 13:04:25 EDT

One option might be to use something like the EventLoopProgressMonitor (an internal class from Editor) inside the Mutex.waitFor() method in case the thread that's waiting is the dispatch thread. Read the Class Javadoc for EventLoopProgressMonitor and you'll see they did this in order to work around similar issues with the Editor.

The problems with this approach are that 
1.) Nested event loops are bad and can lead to other bad effects
2.) In order to run the nested event loop we get a UI dependency, but the
    Mutex and FTPService are all in non-UI...

Comment 3 Martin Oberhuber

2007-09-13 13:36:36 EDT

It turns out that another reason for this deadlock is bug 173995: Some of the Platform's progress monitors use Display.syncExec() so they can block on the UI thread. If they would use only asyncExec the problem would go away.

Therefore, another fix for the concrete situation could be an AsyncProgressMonitor, which wraps all calls to the passed-in progressMonitor with separate Threads that decouple us from the outside:

class AsyncProgressMonitor implements IProgressMonitor {
   private IProgressMonitor wrapped;
   private Thread prev = null;
   public AsyncProgressMonitor(IProgressMonitor wrapped) {
       this.wrapped = wrapped;
   }
   /* ... */
   public void beginTask(final String name, final int totalWork) {
       final Thread prevThread = prev;
       Thread t = new Thread() {
           public void run() {
               if(prevThread!=null) prevThread.join();
               wrapped.beginTask(name, totalWork);
           }
       };
       prev = t;
       t.start();
   }
}

Not being able to asyncExec() in our non-UI code, this approach creates lots of Threads. Joining the Threads is necessary to ensure proper order of the calls. [An alternative around these two issues could be a Queue of Runnables, with a single Thread that picks the Runnables in the order they were posted - very much like the Display event queue, but local to our code]. 

Another issue is that we cannot easily check isCanceled() asynchronously.

At any rate, such decoupling would be quite a lot of non-trivial code and thus risky that short before our 2.0.1 release.

Comment 4 Martin Oberhuber

2007-09-13 15:06:07 EDT

Doing more investigations, as per bug #173995 comment 6, it seems that only AccumulatingProgressMonitor uses the syncExec that leads to the deadlock, whereas the Job framework reports always asynchronously.

In AccumulatingProgressMonitor, the problematic calls using syncExec are:
   beginTask()
   done()
   setTaskName()
wherease the others - worked(), subTask(), isCanceled() in particular - 
run asynchronously.

Further speculations on this are not a real solution but a workaround at its best, because we cannot make assumptions on the kinds of progress monitors being passed into us; but still, it might alleviate the worst problem by considering it. So, assuming AccumulatingProgressMonitor:

  * Mutex only uses worked(), isCanceled()
  * listFiles() and the other methods just use worked(), isCanceled()

So the trick would be to call beginTask(), and done() outside the protected regions. I'm going to write a patch for this which should fix the imminent issue for AccumulatingProgressMonitor but would not be safe for other kinds of monitors. A safe fix would be through bug #180965 and bug #196662.

Comment 5 Martin Oberhuber

2007-09-13 15:53:45 EDT

Created attachment 78377 [details]
Patch fixing the issue for AccumulatingProgressMonitor 

Attached patch fixes the issue for AccumulatingProgressMonitor, by making sure that the critical methods init(), done() are called outside the critical sections. It also ensures that the streams being opened for upload and download are always closed (through ...finally... sections), and that upload does correct error reporting in case the stream can not be opened.

Methods worked(), subTask() and isCanceled() are still called from inside the critical sections, so this is not a perfect fix yet for arbitrary unknown progress monitors. A better fix would be introduced by making sure that downloads and uploads (which do update the progress) do not hog the single critical section, i.e. there should be multiple concurrent uploads and downloads as per bug 180965. SftpService is going that route.

In general, calling outside (to progress) from the critical section is not a good idea but if we want to do progress reporting it looks like we have to do it. So if the patch is accepted, I think we could mark this bug fixed for now, and wait for futher improvements through the other bugs blocking this.

Comment 6 Martin Oberhuber

2007-09-13 15:54:55 EDT

Dave can you review this? - You may find it helpful to disable whitespace diffs in the diff tool (Preferences : General : Compare/Patch : Ignore Whitespace).

Comment 7 Martin Oberhuber

2007-09-13 15:55:50 EDT

Javier can you also review this? - You may find it helpful to disable whitespace diffs in the diff tool (Preferences : General : Compare/Patch : Ignore Whitespace).

Comment 8 Martin Oberhuber

2007-09-13 16:16:25 EDT

Created attachment 78382 [details]
Updated patch (close FTP stream before completePendingCommand)

Attached updated patch must be used to have it working - please review this one.

Comment 9 Martin Oberhuber

2007-09-14 08:57:37 EDT

Patch committed and released for 2.0.1RC2 - Thanks