[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
[mosquitto-dev] lwt interaction with new connections?
|
I'm running mosquitto 1.6.7 on NetBSD-8/amd64. Aside from the following
problem all is stable. [This verges into rubber ducking, but it's also
long to explain why I think something might be wrong.]
I have a broker in one place, and clients in various other places,
fairly far away on the Internet (all in the US, but a number of ISPs).
The clients are of multiple types, but just now I had an issue with two
types at once: ESP8266 running a program I wrote in NodeMCU, and
mosquitto_pub on NetBSD-8/earmv7hf-el (RPI3).
All my messages are qos 2 with retain flag.
The ESP8266 client sets a lwt to publish OFF to sensor/a/b/online:
m = mqtt.Client(mqtt_clientid, 60, mqtt_user, mqtt_password)
m:lwt(mqtt_topic .. "/online", "OFF", 1, 1)
m:on("offline", mqtt_offline)
It then connects, when wifi/dhcp succeeds
m:connect(mqtt_host, 8883, 1, mqtt_connect_succeed, mqtt_connect_fail)
and then when that succeeds publishes ON to /online and starts the
temperature reporting loop (until "sensor_start", the timer that causes
temp to be reported every minute has not even been created).
client:publish(mqtt_topic .. "/online", "ON", 1, 1, mqtt_puback)
sensor_start(report_TPHV)
It then publishes temperature and humidity to sensor/a/b/temperature and
sensor/a/b/humidity every minute. If there are any errors (including
wifi or mqtt disconnect), it reboots and starts over. It also reboots
if more than 900s have passed since getting a PUBACK from posting
temperature.
I then have monitoring (via Home Assistant) to display the temperature
and send alerts on too-low temps and loss of connectivity. This is
perhaps slightly wonky (I'm debugging an alert oddness), but I have
direct evidence of mqtt state for my first example.
This morning, I got a loss of connectivity alarm for a particular sensor
foo/main, and sure enough sensor/foo/main/online was OFF. However, the
temperature was still updating -- so the publish call to online=ON must
have happened. Subscribing to sensor/foo/main/#, I saw OFF, and sane
values that match temperature and humdity. I then saw updates to temp
and humidity.
In the logs, I did see that multiple other clients including two bridges
had disconnected and reconnected. So there were definitely some network
issues.
Looking at the logs and my notifications, I see
Oct 21 06:45:32 s1 mosquitto[854]: Client esp8266-foo has exceeded timeout, disconnecting.
06:46:00 alert from one monitor about loss of foo, but that monitor is
remote and its bridge was also having trouble
Oct 21 06:46:04 s1 mosquitto[854]: New connection from [foo's.numeric.ipv4.addr] on port 8883.
Oct 21 06:47:35 s1 mosquitto[854]: Client <unknown> has exceeded timeout, disconnecting.
Oct 21 06:48:08 s1 mosquitto[854]: New connection from [foo's.numeric.ipv4.addr] on port 8883.
Oct 21 06:48:11 s1 mosquitto[854]: New client connected from [foo's.numeric.ipv4.addr] as esp8266-foo (p2, c1, k60, u'iot.foo').
06:53:24 alert from another monitor (that has a 5m delay before
sending alerts)
Aside from the 06:46:04 connection attempt that didn't work, this seems
like a fairly normal pattern for loss of connectivity followed by it
working again.
So I wonder if there is some sort of race condition with the lwt from
the old connection and the new connection. Having gone through the
above, it seems obvious that when a client connects with a client-id,
and either does or doesn't set a lwt as part of that connection, then
any lwt that is pending as part of that client-id from a previous
connection should be discarded. The point is that an old lwt firing
that sets online=OFF *after* the new connection has set ON is bad.
At the same time, another host had similar issues. This one uses
mosquitto_pub instead of nodemcu/lua.
Mon Oct 21 06:47:29 EDT 2019 on baz, mosquitto_pub exited, and thus
should have been restarted 5s later. probably it got an error on
writing the keepalive.
Oct 21 06:47:31 s1 mosquitto[854]: Client host.baz has exceeded timeout, disconnecting.
probably the keepalive did not arrive on time.
Oct 21 06:48:17 s1 mosquitto[854]: New connection from [baz's.numeric.ipv6.address] on port 8883.
probably this started at 47:34 and this is just TCP timeouts and
when a SYN/SYN-ACK/ACK packet trio made it
Oct 21 06:48:18 s1 mosquitto[854]: New client connected from [baz's.numeric.ipv6.address] as host.baz (p2, c1, k60, u'iot.bazprime').
Mon Oct 21 06:50:24 EDT 2019 alert: baz DISCONNECTED
Mon Oct 21 07:24:06 EDT 2019 alert: baz reconnected
In this case, mosquitto_pub has --repeat-delay 86400. I completely do
not understand the reconnect here; mosquitto_pub did not exit, according
to my logs.
I did not check the broker state for baz; it came back before I got to
that.
I can see about turning on more debugging and see if this happens again,
but I thought I'd describe the symptoms.
Thanks for listening!