Community
Participate
Working Groups
Created attachment 233137 [details] Lua code to reproduce the issue agent.srvcon : doing successive srvcon operations leads to incomplete M3DA data sent to server While executing racon unittest suite, some server connections end with "connection closed". In m3da-server logs we see: "m3da.codec.DecoderException: this is not a map OPCODE : 97" (More server logs in attachement). I have found a way to reproduce the issue every time: -a specific succession of calls to Lua user API connectToServer, pushData, triggerPolicy make this appear - see reproduce_connection_close.txt in attachement. I don't think we should focus on "OPCODE 97" ., it's more likely an incomplete message (see below). I've checked EMP data coming from asset, it's seems ok to me. Looking into the function used to reproduce the problem : I have remark - need to call internal_code function 2 times quickly to reproduce the problem - when replacing connectToServer(0) by connectToServer(1) or connectToServer(nil) seems to avoid the issue. Those 2 remarks make me think it may be a concurrency issue. Now looking at message sent to the server, especially those containing the asset data (see m3da_msg1.txt and m3da_msg2.txt): As the same variable is pushed, I would have expected the M3DA payload to be identical, there are not: in m3da_msg2, I think the M3DA msg is incomplete: the body is missing! I was running locally m3da server from https://github.com/SierraWireless/m3da-server, rev 51a7521967e5a4fdf6d0e407f190145f3a6159e3. I've also attached the full tcpdump.
Created attachment 233138 [details] Correct M3DA message
Created attachment 233139 [details] Incomplete M3DA message
Created attachment 233140 [details] Server logs
Created attachment 233141 [details] Tcp dump
Created attachment 233435 [details] simpler code sample reproducing the issue It's a race condition between agent.srvcon.connectToServer() and the racon.triggerPolicy() operations. The first connectToServer will be acknowledged while the first triggerPolicy runs, and will purge an SDB table without properly dereferencing it from its policy. Given that connectToServer() isn't an official API (only accessible from within the agent's process), and that I can't reproduce this without calling connectToServer calls in the middle of normal EMP-driven commands, is it really a bug?
Fixed: the list of database-cleaning callbacks, to be executed when srvcon receives the server's "200 OK", was the wrong one. Instead of taking the callbacks present when data were sent, it was taking the callbacks present when the OK was received, thus incorrectly cleaning the data buffered accumulated between the request emission and the response reception.