Summary: | [ftp] Deadlock when comparing two files on FTP | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Tools] Target Management | Reporter: | Martin Oberhuber <mober.at+eclipse> | ||||||||
Component: | RSE | Assignee: | Martin Oberhuber <mober.at+eclipse> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | Martin Oberhuber <mober.at+eclipse> | ||||||||
Severity: | critical | ||||||||||
Priority: | P1 | CC: | dmcknigh, eclipse, javier.montalvoorus | ||||||||
Version: | 2.0 | Keywords: | investigate | ||||||||
Target Milestone: | 2.0.1 | Flags: | dmcknigh:
review+
javier.montalvoorus: review+ |
||||||||
Hardware: | PC | ||||||||||
OS: | Linux-GTK | ||||||||||
Whiteboard: | |||||||||||
Bug Depends on: | 180965, 173995, 196662 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Martin Oberhuber
2007-09-13 12:25:00 EDT
I'm finding the issue extremely hard to reproduce (can't get to it again - in spite of clearing cache, restarting everything etc). Still I think that any deadlock we know of should be fixed as soon as possible, because it affects all of Eclipse and can thus lead to loss of data. Unfortunately, right now I'm not sure how to fix this properly with a fix that does not need bug 180965 fixed. We would need to make sure that any code inside a protected region does not call out to the dispatch thread, such that we can ensure it can always complete. This would mean that we either have no progress reporting from the protected regions at all, or the progress reporting must be made asynchronous. Another workaround (though not fix) might be to decrease the maximum wait time of the Mutex. Then, in case of a deadlock, it could complete when the maximum wait time is elapsed. Decreasing the wait time, however, could mean that in case of a long-going download the other query would not wait long enough and fail. A variant of decreasing the wait time could be to decrease it only if the dispatch thread is affected (use e.g. 5 seconds if on the dispatch thread, or 2 minutes [or much longer?] if on a background thread). Returning failure due to elapsed Mutex timeout could happen frequently as long as downloads are also hogging the Mutex; it would happen much less frequently when downloads were asynchronous as requested by bug #196662. Anybody have any other ideas? One option might be to use something like the EventLoopProgressMonitor (an internal class from Editor) inside the Mutex.waitFor() method in case the thread that's waiting is the dispatch thread. Read the Class Javadoc for EventLoopProgressMonitor and you'll see they did this in order to work around similar issues with the Editor. The problems with this approach are that 1.) Nested event loops are bad and can lead to other bad effects 2.) In order to run the nested event loop we get a UI dependency, but the Mutex and FTPService are all in non-UI... It turns out that another reason for this deadlock is bug 173995: Some of the Platform's progress monitors use Display.syncExec() so they can block on the UI thread. If they would use only asyncExec the problem would go away. Therefore, another fix for the concrete situation could be an AsyncProgressMonitor, which wraps all calls to the passed-in progressMonitor with separate Threads that decouple us from the outside: class AsyncProgressMonitor implements IProgressMonitor { private IProgressMonitor wrapped; private Thread prev = null; public AsyncProgressMonitor(IProgressMonitor wrapped) { this.wrapped = wrapped; } /* ... */ public void beginTask(final String name, final int totalWork) { final Thread prevThread = prev; Thread t = new Thread() { public void run() { if(prevThread!=null) prevThread.join(); wrapped.beginTask(name, totalWork); } }; prev = t; t.start(); } } Not being able to asyncExec() in our non-UI code, this approach creates lots of Threads. Joining the Threads is necessary to ensure proper order of the calls. [An alternative around these two issues could be a Queue of Runnables, with a single Thread that picks the Runnables in the order they were posted - very much like the Display event queue, but local to our code]. Another issue is that we cannot easily check isCanceled() asynchronously. At any rate, such decoupling would be quite a lot of non-trivial code and thus risky that short before our 2.0.1 release. Doing more investigations, as per bug #173995 comment 6, it seems that only AccumulatingProgressMonitor uses the syncExec that leads to the deadlock, whereas the Job framework reports always asynchronously. In AccumulatingProgressMonitor, the problematic calls using syncExec are: beginTask() done() setTaskName() wherease the others - worked(), subTask(), isCanceled() in particular - run asynchronously. Further speculations on this are not a real solution but a workaround at its best, because we cannot make assumptions on the kinds of progress monitors being passed into us; but still, it might alleviate the worst problem by considering it. So, assuming AccumulatingProgressMonitor: * Mutex only uses worked(), isCanceled() * listFiles() and the other methods just use worked(), isCanceled() So the trick would be to call beginTask(), and done() outside the protected regions. I'm going to write a patch for this which should fix the imminent issue for AccumulatingProgressMonitor but would not be safe for other kinds of monitors. A safe fix would be through bug #180965 and bug #196662. Created attachment 78377 [details] Patch fixing the issue for AccumulatingProgressMonitor Attached patch fixes the issue for AccumulatingProgressMonitor, by making sure that the critical methods init(), done() are called outside the critical sections. It also ensures that the streams being opened for upload and download are always closed (through ...finally... sections), and that upload does correct error reporting in case the stream can not be opened. Methods worked(), subTask() and isCanceled() are still called from inside the critical sections, so this is not a perfect fix yet for arbitrary unknown progress monitors. A better fix would be introduced by making sure that downloads and uploads (which do update the progress) do not hog the single critical section, i.e. there should be multiple concurrent uploads and downloads as per bug 180965. SftpService is going that route. In general, calling outside (to progress) from the critical section is not a good idea but if we want to do progress reporting it looks like we have to do it. So if the patch is accepted, I think we could mark this bug fixed for now, and wait for futher improvements through the other bugs blocking this. Dave can you review this? - You may find it helpful to disable whitespace diffs in the diff tool (Preferences : General : Compare/Patch : Ignore Whitespace). Javier can you also review this? - You may find it helpful to disable whitespace diffs in the diff tool (Preferences : General : Compare/Patch : Ignore Whitespace). Created attachment 78382 [details]
Updated patch (close FTP stream before completePendingCommand)
Attached updated patch must be used to have it working - please review this one.
Patch committed and released for 2.0.1RC2 - Thanks |