[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ecf-dev] How to deal with recovering an ecf remote service connection

Hi Christoph,

Sorry for jumping in the middle of your discussion with Wim.   I'm just starting my day here though and I hope to contribute to this.

On 2/1/2016 7:42 AM, Keimel, Christoph wrote:

Hi Wim,


Lets’ see if I get this correctly:


Let’s say that A wants to know when a touch sensor on B is getting pressed (true/false). What I am doing right now is: A puts up a whiteboard service (TouchSensorSniffer). This service is picked up by B using a ServiceTracker [1]. B then holds on to the service and calls TouchSensorSniffer#onStateChanged whenever the touch sensor state changes. (Of course I also clear my internal cache when the service gets removed.)


I’ll use this simple setup to describe my situation: After both A and B are started everything is fine and B has discovered the TouchSensorSniffer from A. Now I disconnect B from the network by pulling the LAN cable. Both A and B continue to run.

Yes they continue to run, but one question is:   On B (svc consumer) does the remote service proxy get unregistered after 30s/keepAlive timeout?   If using service tracker, this should result in the removeService method being called.   It won't happen immediately (since the default keepAlive is 30s), but it should happen.  This is because the generic provider has failure detection.

If the state of the touch sensor changes at this moment B would try to send this information over the TouchSensorSniffer to A. But since B is disconnected from the network, this request fails after the timeout. B thinks this is a temporary error and just logs it.

B should probably do something other than just log this as a temporary error.


If I reconnect the LAN cable after a couple of seconds and the press my touch sensor again, B will again use the TouchSensorSniffer service to send the state change. This time everything works out because the network is back up: Cool. But let’s assume I don’t reconnect right away but I wait until the keepalive period (default 30 seconds) is over. What happens now is that the TouchSensorSniffer is unregistered in B which is ok, since we assume that the connection is gone for good.

Right...this is referred to as 'fail stop'.  One has to assume that the connection is gone for good, because it may actually be gone for good :).

If I touch the sensor now B sees that no TouchSensorSniffer services are registered and therefore doesn’t send this information anywhere. Also good. Now, after 60 seconds, I reconnect the LAN cable. Both A and B are still running but B doesn’t pick up on the TouchSensorSniffer from A. They stay disconnected.



This last part is based on my observations, so I’m not sure I understand this completely. Does my description come close to the truth and is this the result that is to be expected?

Yes, I think so.

Or would you expect the discovery on B to find the TouchSensorSniffer from A again after the network connection has been reestablished?

This is where the specifics of the discovery provider interact with the specifics of the distribution provider.    Wim is the expert on zookeeper, but just because the network connection is reestablished I don't believe that will trigger a rediscovery of a previously discovered service.


Or is the problem that I am holding on to an instance of TouchSensorSniffer on B?

I think that holding onto the instance of TouchSensorSniffer on B is essentially assuming that this existing connection will be reestablished *within 30s*, and I think that this is probably not a reasonable assumption for your problematic network.

I could stop using a ServiceTracker and look into the OSGi service registry directly to search for all implementations of TouchSensorSniffer anytime the state changes via BundleContext#getServiceReferences. I see that this would change to situation slightly, because I would use BundleContext#ungetService right after sending the information and then getting the service again for the next event. But I am not sure that this would change the basic situation, since the registry itself is already caching the available remote services. Or am I wrong about this?

The service registry is holding onto the remote service proxy's ServiceReference, but this proxy will be unregistered when/if the remote service is unregistered via the failure detection/keepAlive/timeout (30s by default).   This unregistration of the proxy should result in removeService (ServiceTracker) and unbind for DS.   Basically you need something to notify your code when the proxy becomes unregistered so that you can give up/stop using the TouchSensorSniffer on B.

Now, one question is:  once detected, what should B do to recover from a network failure?  This can be a difficult question to answer in general, because the failure could be permanent (so no use retrying), or it could be very short and would/will heal very quickly.   Predicting the future is difficult :).

There are mechanisms to deal with these problems.   One is extending/customizing the OSGi Topology Manager, which would allow implementing some recovery strategy for a service that has gone away (e.g. import retry).   Also there are/is some tuning of the ECF generic provider failure detection that can be done.  Finally, the ECF generic provider (and others...like the JMS provider) also have some notion of communication groups and group membership, and so this can be used to associate remote services with each other.