I did not get a packet capture as I don't currently have ssh access to that instance, but am pretty certain it came from the broker. First, I connected directly to the instance, rather than through the load balancer, and the problem persisted. Second, my other client application also saw the issue. Here is a breakdown of what happened:
mosquitto and my client started up normally
client successfully connects, subscribes to a few topics, and then sends a batch of publishes to various topics (a batch of publishes is about 20 different publishes, each to a different topic, with each message being only 11 bytes)
mosquitto sees first batch of publishes from my client
mosquitto sees the second batch of publishes from the client 5 minutes later (the interval between batches is 5 minutes)
mosquitto does NOT see MOST of third batch of publishes from mqttd at the 10 minute mark. Below are example logs from mosquitto that show only 2 messages get published (out of about 20):
05:38:51
1491370731: Received PUBLISH from zenreach (d0, q0, r0, m0, '2.1/LOC/SMMug3jKfsoLKlXy/LS/PAQ', ... (11 bytes))
05:38:51
1491370731: Received PUBLISH from zenreach (d0, q0, r0, m0, '2.1/LOC/SMMug3jKfsoLKlXy/LS/PAQ', ... (11 bytes))
during third batch of publishes, my client starts printing timeout errors on each publish (I'm using WaitTimeout() with a timeout of 5 seconds as shown in my original post)
about 15 minutes after the errors started occurring, mosquitto disconnects the client user because of timeout. This makes sense since the keepalive is set to 10 minutes and since mosquitto isn't receiving any publishes (or pings even), it should disconnect the client. Below is the log from mosquitto:
05:53:51
1491371631: Client zenreach has exceeded timeout, disconnecting.
05:53:51
1491371631: Socket error on client zenreach, disconnecting.
The question here is, why do all publishes just suddenly stop working on the 3rd batch (including pings)?