Bug 412426 - agent.srvcon : doing successive srvcon operations leads to incomplete M3DA data sent to server
Summary: agent.srvcon : doing successive srvcon operations leads to incomplete M3DA da...
Status: RESOLVED FIXED
Alias: None
Product: Mihini
Classification: IoT
Component: Agent (show other bugs)
Version: 0.9   Edit
Hardware: PC Linux
: P1 major (vote)
Target Milestone: 0.9 M1   Edit
Assignee: Project Inbox CLA
QA Contact: Robert Jacolin CLA
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-05 13:46 EDT by Laurent Barthelemy CLA
Modified: 2013-08-02 04:50 EDT (History)
1 user (show)

See Also:


Attachments
Lua code to reproduce the issue (633 bytes, text/plain)
2013-07-05 13:46 EDT, Laurent Barthelemy CLA
no flags Details
Correct M3DA message (300 bytes, text/plain)
2013-07-05 13:47 EDT, Laurent Barthelemy CLA
no flags Details
Incomplete M3DA message (225 bytes, text/plain)
2013-07-05 13:47 EDT, Laurent Barthelemy CLA
no flags Details
Server logs (34.51 KB, text/plain)
2013-07-05 13:48 EDT, Laurent Barthelemy CLA
no flags Details
Tcp dump (17.93 KB, application/octet-stream)
2013-07-05 13:48 EDT, Laurent Barthelemy CLA
no flags Details
simpler code sample reproducing the issue (539 bytes, text/plain)
2013-07-12 14:00 EDT, Fabien Fleutot CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Laurent Barthelemy CLA 2013-07-05 13:46:43 EDT
Created attachment 233137 [details]
Lua code to reproduce the issue

agent.srvcon : doing successive srvcon operations leads to incomplete M3DA data sent to server


While executing racon unittest suite, some server connections end with "connection closed".
In m3da-server logs we see: "m3da.codec.DecoderException: this is not a map OPCODE : 97" (More server logs in attachement).

I have found a way to reproduce the issue every time: 
 -a specific succession of calls to Lua user API connectToServer, pushData, triggerPolicy make this appear
 - see reproduce_connection_close.txt in attachement.
 
I don't think we should focus on "OPCODE 97" ., it's more likely an incomplete message (see below).

I've checked EMP data coming from asset, it's seems ok to me.

Looking into the function used to reproduce the problem : I have remark
 - need to call internal_code function 2 times quickly to reproduce the problem
 - when replacing connectToServer(0) by connectToServer(1) or connectToServer(nil) seems to avoid the issue.
Those 2 remarks make me think it may be a concurrency issue.

Now looking at message sent to the server, especially those containing the asset data
(see m3da_msg1.txt and m3da_msg2.txt): As the same variable is pushed, I would have expected the M3DA payload to be identical, there are not:
in m3da_msg2, I think the M3DA msg is incomplete: the body is missing!

I was running locally m3da server from https://github.com/SierraWireless/m3da-server, rev 51a7521967e5a4fdf6d0e407f190145f3a6159e3.

I've also attached the full tcpdump.
Comment 1 Laurent Barthelemy CLA 2013-07-05 13:47:34 EDT
Created attachment 233138 [details]
Correct M3DA message
Comment 2 Laurent Barthelemy CLA 2013-07-05 13:47:52 EDT
Created attachment 233139 [details]
Incomplete M3DA message
Comment 3 Laurent Barthelemy CLA 2013-07-05 13:48:12 EDT
Created attachment 233140 [details]
Server logs
Comment 4 Laurent Barthelemy CLA 2013-07-05 13:48:31 EDT
Created attachment 233141 [details]
Tcp dump
Comment 5 Fabien Fleutot CLA 2013-07-12 14:00:29 EDT
Created attachment 233435 [details]
simpler code sample reproducing the issue

It's a race condition between agent.srvcon.connectToServer() and the racon.triggerPolicy() operations. The first connectToServer will be acknowledged while the first triggerPolicy runs, and will purge an SDB table without properly dereferencing it from its policy.

Given that connectToServer() isn't an official API (only accessible from within the agent's process), and that I can't reproduce this without calling connectToServer calls in the middle of normal EMP-driven commands, is it really a bug?
Comment 6 Fabien Fleutot CLA 2013-07-16 06:22:08 EDT
Fixed: the list of database-cleaning callbacks, to be executed when srvcon receives the server's "200 OK", was the wrong one. Instead of taking the callbacks present when data were sent, it was taking the callbacks present when the OK was received, thus incorrectly cleaning the data buffered accumulated between the request emission and the response reception.