Bug 321766 - Connector service connect and disconnect can interfere with each other across threads
Summary: Connector service connect and disconnect can interfere with each other across...
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 3.0.3   Edit
Hardware: All All
: P1 major (vote)
Target Milestone: 3.2.2+   Edit
Assignee: David Dykstal CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-04 14:06 EDT by David Dykstal CLA
Modified: 2011-04-06 11:51 EDT (History)
1 user (show)

See Also:
mober.at+eclipse: pmc_approved+
ddykstal.eclipse: review? (dmcknigh)


Attachments
patch for connect/disconnect interference bug (5.55 KB, patch)
2010-09-07 17:36 EDT, David Dykstal CLA
no flags Details | Diff
mylyn/context/zip (868 bytes, application/octet-stream)
2010-09-07 17:36 EDT, David Dykstal CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Dykstal CLA 2010-08-04 14:06:23 EDT
We have run into situations in products where a connect and disconnect can be processing simultaneously in the same connector service. This can result in the connector service throwing an NPE due to its connection reference disappearing.

Connect and disconnect operations are long running and call ISV provided routines. Connect, in particular, can wait indefinitely for user interaction. Thus, connect and disconnect in the connector service must be serialized using safe techniques employing timeouts. Simply making these operations "synchronized" courts deadlock.
Comment 1 David Dykstal CLA 2010-08-04 14:07:50 EDT
Adding to the 3.2.1 schedule. A backport will also be needed to 3.0.x for IBM products.
Comment 2 David Dykstal CLA 2010-08-09 12:13:42 EDT
Fix involves creating an internal semaphore class to serialize the entrance into the connect and disconnect methods.

Verified fix by halting in disconnect after processing and initiating a connect operation. Connect operation times out correctly.
Comment 3 Martin Oberhuber CLA 2010-08-19 08:45:05 EDT
Dave - I have some serious concerns about this implementation, so I'm reopening for discussion:

1. I see some risk of deadlock with your implementation, for instance:
   - internalConnect() etc are hook methods. Since your Semaphore is not
     reentrant, when any plugged-in code that's called from below 
     connect() or disconnect() happens to call connect() or disconnect()
     again you are deadlocked since the Semaphore would never be released.
     This is not entirely unlikely, I seem to remember seeing some code
     which called disconnect() from inside internalConnect() when a failure
     was detected. 

2. Why do you define a new Semaphore class, when the 
      org.eclipse.rse.services.Mutex 
   class could be used? The way you use your Semaphore, I cannot see what
   the Mutex couldn't do for you. Usage of the Mutex is very simple, see
   the SftpFileService implementation.

3. Multi-threading is tricky, and shouldn't be committed to a Maintenance
   Stream without posting a patch on Bugzilla for review (which would have
   helped me catch this much sooner).

I don't have a better solution for your "simultaneous connect / disconnect" problem out of the box, but I remember that Dave M came up with a nice solution for a similar problem (related to connecting subsystems, if I'm not mistaken).

Unless you can argue that your change is safe, I would like to ask you back out the change until something better is found.
Comment 4 David Dykstal CLA 2010-08-19 11:10:29 EDT
Martin --

You are correct with your concerns. I had not considered the case that connect might call disconnect within the same thread. I tested the multithread case which was causing the original problem and that did, in fact, work correctly. Likewise, I should have used Mutex rather than invent my own class. If Mutex is reentrant then both of these cases can be addressed easily.

I'm assuming we can address these and backport a corrected fix. The original problem is serious enough to warrant some sort of fix at the 3.0.x level.
Comment 5 Martin Oberhuber CLA 2010-09-07 06:38:00 EDT
ping, this is still open and I'm not going to release the current code to the maintenance Stream. I did notice that any potential deadlock would be resolved after 2 minutes due to the timeout, but it still doesn't quite seem right.
Comment 6 David Dykstal CLA 2010-09-07 09:03:28 EDT
I am working on this today. Should have something for review this evening.
Comment 7 David Dykstal CLA 2010-09-07 15:50:28 EDT
It appears to me that both Semaphore and Mutex accomplish the same thing in this context. Mutex has a couple of improvements: It ensures that only the thread that acquired the mutex releases it. The Semaphore implementation does not do that explicitly, but practically, since it is not publicly visible, it is impossible to for another thread to invoke a release unless it invokes acquire. Mutex allows cancellation using its progress monitor, Semaphore does not. Mutex handles InterruptedExceptions better by terminating a wait loop immediately, Semaphore does not, but waits for the whole timeout. Mutex also tests for interrupted threads on entry to acquire.

In general, Mutex is a better implementation, but Semaphore does not introduce any bugs in its usage that I can tell from inspection. I will use and test Mutex.

The more difficult aspect is accounting for reentrancy. Neither Mutex nor Semaphore are reentrant and therefore the situation you describe where one thread invokes either connect or disconnect on the same connector service from internalConnect or internalDisconnect can cause problems. This is unlikely, but certainly possible. I cannot find the code internally where we might do this, but it would make sense to guard against it. In this case, we can either allow the inner disconnect or connect to proceed by bypassing the already acquired mutex or to fail immediately. I would argue that failure is somehow "correct", but that proceeding is the current behavior and that current behavior should be preserved in this case.

Comments?
Comment 8 David Dykstal CLA 2010-09-07 17:36:11 EDT
Created attachment 178363 [details]
patch for connect/disconnect interference bug

The patch is against the 1.19 version of the code. Load the patch and compare to 1.19 to see the changes since then. Compare to 1.17 to see the whole fix.
Comment 9 David Dykstal CLA 2010-09-07 17:36:14 EDT
Created attachment 178364 [details]
mylyn/context/zip
Comment 10 David Dykstal CLA 2010-09-07 17:44:15 EDT
Dave and Martin --

Please review the patch. The commented out code is left there for reference but will be removed when committed.
Comment 11 Martin Oberhuber CLA 2010-09-08 05:54:27 EDT
The new code is definitely better than what's in HEAD now, so I suggest you commit it to HEAD for easier review:
  - Code reuse by Mutex instead of Semaphore
  - waiting for Mutex is now cancelable
  - reentrancy is solved

I still have a couple of concerns that I'd like to discuss before I release 
it to 3.2.1:

- TimeoutException. When this happens, the exception is caught and logged,
  so for a caller of connect() it looks like it should be connected... but
  it is not. It feels like an exception should be thrown to the outside too
  in case of a Timeout.

- Semantics. Are we sure that Semantics are as desired? For instance, when
  "Re-open Remote Systems view to previous state" is on, it will try to 
  reconnect on startup. This can take long (indefinitely), and it looks like
  with this change, the user has no chance to interrupt this long-running
  connect (because it's a first-come / first-serve semantics serializing
  all requests). Trying to "disconnect" the connecting connection will 
  wait for connect to finish and then run into timeout. 

I'm wondering whether it wouldn't be simpler and more desirable if disconnect always had priority over connect, i.e. even allow it to "interrupt" any Threads that are currently waiting or trying to connect. Mutex#interruptAll() supports this to some extent. But I didn't think this through to the end and I'm not even sure whether it would solve the original problem.

Could you be more specific about the original problem - is simultaneous connect and disconnect kind of an error situation or normal, which one should have priority, and in what respect is the NPE being thrown problematic i.e. how much of an itch is it? Could there be a solution / workaround on the client side? Having methods fail fast in case of an unsupported state of the connection is another option, and safer than waiting.
Comment 12 David Dykstal CLA 2010-09-08 08:12:40 EDT
> - TimeoutException. When this happens, the exception is caught and logged,
>   so for a caller of connect() it looks like it should be connected... but
>   it is not. It feels like an exception should be thrown to the outside too
>   in case of a Timeout.

We cannot just throw a TimeoutException for several reasons:
1) It is defined as an internal class, no one has access to it. This is by design. TimeoutException is introduced in 1.5 as part of java.util.concurrent. To make this one visible would be confusing. I used it internally to clearly document in the code what was actually going on.
2) throwing a TimeoutException changes the contract

In the case of connect the timeout is logged and an OperationCanceledException is thrown. This seems like a reasonable compromise since OperationCanceledException is part of the contract and one could argue that the expiration of a timeout is equivalent to a cancel.

In the case of disconnect, OperationCanceledException is not thrown. It turns out that the handling for OperationCanceledException on disconnect is not quite correct in RSE and I was trying to minimize impact in a maintenance stream. This is something we should fix in our 3.3 release.

In both cases, a timeout is not really an error, but a normal condition that can be expected occasionally.
Comment 13 David Dykstal CLA 2010-09-08 08:29:13 EDT
> - Semantics. Are we sure that Semantics are as desired? For instance, when
>   "Re-open Remote Systems view to previous state" is on, it will try to 
>   reconnect on startup. This can take long (indefinitely), and it looks like
>   with this change, the user has no chance to interrupt this long-running
>   connect (because it's a first-come / first-serve semantics serializing
>   all requests). Trying to "disconnect" the connecting connection will 
>   wait for connect to finish and then run into timeout. 
> 
> I'm wondering whether it wouldn't be simpler and more desirable if disconnect
> always had priority over connect, i.e. even allow it to "interrupt" any Threads
> that are currently waiting or trying to connect. Mutex#interruptAll() supports
> this to some extent. But I didn't think this through to the end and I'm not
> even sure whether it would solve the original problem.
> 
> Could you be more specific about the original problem - is simultaneous connect
> and disconnect kind of an error situation or normal, which one should have
> priority, and in what respect is the NPE being thrown problematic i.e. how much
> of an itch is it? Could there be a solution / workaround on the client side?
> Having methods fail fast in case of an unsupported state of the connection is
> another option, and safer than waiting.

It should be the case that, during startup, the user can manually cancel the connect operations using the progress monitor. That SHOULD work in all cases. If it does not it is probably a different bug. However, giving disconnect priority is a possibility. I think this is something that should be considered for our 3.3 release.

I do not have the actual details of the error, it occurs in a situation where EFS is being used which is probably why we have not seen it in our testing. It appears to be a classic race condition. Our service team has verified that the fix works for them. It apparently happened often enough that it was given a high severity.
Comment 14 David Dykstal CLA 2010-09-08 21:58:39 EDT
Committed. Please review. Thanks! I really want to get this closed before RC4 freeze.
Comment 15 Martin Oberhuber CLA 2010-09-09 09:53:22 EDT
As per comment 10 I thought you wanted to remove the commented out code?

To me it still feels like a disconnect that comes in while a connect is ongoing should likely cancel that connect. Waiting until connected seems useless when it will disconnect immediately after.

The more I think about this, the more it seems to me that concurrent connect and disconnect requests are maybe an issue with the client code. Has this ever been analyzed? It might be worthwile adding some tracing code which prints a stacktrace when disconnect detects that a connect is still ongoing. Perhaps another cause of this problem is that RSE doesn't have a "connecting" state (we only have connected or disconnected, but not the grey area in between).

Anyways, from this point of view (treating concurrent connect / disconnect as exceptional), I am OK with the change. We cannot 100% guarantee that bad client code cannot deadlock. On the other hand, any code that would deadlock would likely have thrown the NPE before the change.
Comment 16 David Dykstal CLA 2010-09-09 10:29:04 EDT
I agree that the correct final solution is to cancel outstanding connects if a disconnect is received. This is substantially more work and potentially disruptive since we must test implementers of connect to ensure that they clean up correctly after being interrupted.

I've removed the commented out code. My mistake. Simply forgot to go back and do it.
Comment 17 Martin Oberhuber CLA 2010-09-09 13:34:43 EDT
Released, please test with the next M-Build (should be M20100909-1340).
Comment 18 David Dykstal CLA 2010-09-13 16:04:30 EDT
(In reply to comment #17)
> Released, please test with the next M-Build (should be M20100909-1340).

Done. Looks good.
Comment 19 Martin Oberhuber CLA 2010-11-09 08:26:41 EST
Verified that this has been released to 3.3m3 as well.