Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[mosquitto-dev] lwt interaction with new connections?

I'm running mosquitto 1.6.7 on NetBSD-8/amd64.  Aside from the following
problem all is stable.  [This verges into rubber ducking, but it's also
long to explain why I think something might be wrong.]

I have a broker in one place, and clients in various other places,
fairly far away on the Internet (all in the US, but a number of ISPs).
The clients are of multiple types, but just now I had an issue with two
types at once: ESP8266 running a program I wrote in NodeMCU, and
mosquitto_pub on NetBSD-8/earmv7hf-el (RPI3).

All my messages are qos 2 with retain flag.

The ESP8266 client sets a lwt to publish OFF to sensor/a/b/online:

  m = mqtt.Client(mqtt_clientid, 60, mqtt_user, mqtt_password)
  m:lwt(mqtt_topic .. "/online", "OFF", 1, 1)
  m:on("offline", mqtt_offline)

It then connects, when wifi/dhcp succeeds

  m:connect(mqtt_host, 8883, 1, mqtt_connect_succeed, mqtt_connect_fail)

and then when that succeeds publishes ON to /online  and starts the
temperature reporting loop (until "sensor_start", the timer that causes
temp to be reported every minute has not even been created).

  client:publish(mqtt_topic .. "/online", "ON", 1, 1, mqtt_puback)
  sensor_start(report_TPHV)

It then publishes temperature and humidity to sensor/a/b/temperature and
sensor/a/b/humidity every minute.  If there are any errors (including
wifi or mqtt disconnect), it reboots and starts over.  It also reboots
if more than 900s have passed since getting a PUBACK from posting
temperature.

I then have monitoring (via Home Assistant) to display the temperature
and send alerts on too-low temps and loss of connectivity.  This is
perhaps slightly wonky (I'm debugging an alert oddness), but I have
direct evidence of mqtt state for my first example.

This morning, I got a loss of connectivity alarm for a particular sensor
foo/main, and sure enough sensor/foo/main/online was OFF.  However, the
temperature was still updating -- so the publish call to online=ON must
have happened.  Subscribing to sensor/foo/main/#, I saw OFF, and sane
values that match temperature and humdity.  I then saw updates to temp
and humidity.

In the logs, I did see that multiple other clients including two bridges
had disconnected and reconnected.  So there were definitely some network
issues.

Looking at the logs and my notifications, I see

Oct 21 06:45:32 s1 mosquitto[854]: Client esp8266-foo has exceeded timeout, disconnecting.
  06:46:00 alert from one monitor about loss of foo, but that monitor is
  remote and its bridge was also having trouble
Oct 21 06:46:04 s1 mosquitto[854]: New connection from [foo's.numeric.ipv4.addr] on port 8883.
Oct 21 06:47:35 s1 mosquitto[854]: Client <unknown> has exceeded timeout, disconnecting.
Oct 21 06:48:08 s1 mosquitto[854]: New connection from [foo's.numeric.ipv4.addr] on port 8883.
Oct 21 06:48:11 s1 mosquitto[854]: New client connected from [foo's.numeric.ipv4.addr] as esp8266-foo (p2, c1, k60, u'iot.foo').
  06:53:24 alert from another monitor (that has a 5m delay before
  sending alerts)

Aside from the 06:46:04 connection attempt that didn't work, this seems
like a fairly normal pattern for loss of connectivity followed by it
working again.

So I wonder if there is some sort of race condition with the lwt from
the old connection and the new connection.  Having gone through the
above, it seems obvious that when a client connects with a client-id,
and either does or doesn't set a lwt as part of that connection, then
any lwt that is pending as part of that client-id from a previous
connection should be discarded.   The point is that an old lwt firing
that sets online=OFF *after* the new connection has set ON is bad.

At the same time, another host had similar issues.  This one uses
mosquitto_pub instead of nodemcu/lua.

  Mon Oct 21 06:47:29 EDT 2019 on baz, mosquitto_pub exited, and thus
    should have been restarted 5s later.  probably it got an error on
    writing the keepalive.
Oct 21 06:47:31 s1 mosquitto[854]: Client host.baz has exceeded timeout, disconnecting.
    probably the keepalive did not arrive on time.
Oct 21 06:48:17 s1 mosquitto[854]: New connection from [baz's.numeric.ipv6.address] on port 8883.
    probably this started at 47:34 and this is just TCP timeouts and
    when a SYN/SYN-ACK/ACK packet trio made it 
Oct 21 06:48:18 s1 mosquitto[854]: New client connected from [baz's.numeric.ipv6.address] as host.baz (p2, c1, k60, u'iot.bazprime').
  Mon Oct 21 06:50:24 EDT 2019 alert: baz DISCONNECTED
  Mon Oct 21 07:24:06 EDT 2019 alert: baz reconnected

In this case, mosquitto_pub has --repeat-delay 86400.  I completely do
not understand the reconnect here; mosquitto_pub did not exit, according
to my logs.

I did not check the broker state for baz; it came back before I got to
that.


I can see about turning on more debugging and see if this happens again,
but I thought I'd describe the symptoms.

Thanks for listening!




Back to the top