|Re: [ecf-dev] How to deal with recovering an ecf remote service connection|
Sorry for jumping in the middle of your discussion with Wim. I'm just starting my day here though and I hope to contribute to this.
On 2/1/2016 7:42 AM, Keimel, Christoph wrote:
Yes they continue to run, but one question is: On B (svc consumer) does the remote service proxy get unregistered after 30s/keepAlive timeout? If using service tracker, this should result in the removeService method being called. It won't happen immediately (since the default keepAlive is 30s), but it should happen. This is because the generic provider has failure detection.
B should probably do something other than just log this as a temporary error.
Right...this is referred to as 'fail stop'. One has to assume that the connection is gone for good, because it may actually be gone for good :).
Yes, I think so.
This is where the specifics of the discovery provider interact with the specifics of the distribution provider. Wim is the expert on zookeeper, but just because the network connection is reestablished I don't believe that will trigger a rediscovery of a previously discovered service.
I think that holding onto the instance of TouchSensorSniffer on B is essentially assuming that this existing connection will be reestablished *within 30s*, and I think that this is probably not a reasonable assumption for your problematic network.
The service registry is holding onto the remote service proxy's ServiceReference, but this proxy will be unregistered when/if the remote service is unregistered via the failure detection/keepAlive/timeout (30s by default). This unregistration of the proxy should result in removeService (ServiceTracker) and unbind for DS. Basically you need something to notify your code when the proxy becomes unregistered so that you can give up/stop using the TouchSensorSniffer on B.
Now, one question is: once detected, what should B do to recover from a network failure? This can be a difficult question to answer in general, because the failure could be permanent (so no use retrying), or it could be very short and would/will heal very quickly. Predicting the future is difficult :).
There are mechanisms to deal with these problems. One is extending/customizing the OSGi Topology Manager, which would allow implementing some recovery strategy for a service that has gone away (e.g. import retry). Also there are/is some tuning of the ECF generic provider failure detection that can be done. Finally, the ECF generic provider (and others...like the JMS provider) also have some notion of communication groups and group membership, and so this can be used to associate remote services with each other.