Debugging data corruption issue in a distributed system written for TinyOS
TinyOS is operating system that is used for developing applications for distributed wireless sensor networks. That is what academics used it for. However in these days it is more popular to call it operating system for Internet of things (IoT). TinyOS abstractions are quite nice to use if everything works as expected, however if you have to check under the hood hell breaks loose.
The problem #
The problem with my application is that occasionally it receives a packet where payload length is claimed to be 248 bytes long. However application TOSH_DATA_LENGTH
is just 86 bytes and Atmega128RFA1 radio buffer is 127 bytes long. Therefore it is impossible that this kind of packet can be received from air.
Fortunately there were asserts in the module where problematic packet reception was caught.
event message_t* Radio.receive[uint8_t id](message_t* msg, void * payload, uint8_t len) {
ASSERT(msg != NULL);
ASSERT(payload != NULL);
ASSERT_GREATER_EQUAL(call Packet.maxPayloadLength(), len);
memmove(call Packet.getPayload(msg, len), payload, len);
Notice that there is also memmove
involved which will do unexpected damage in embedded system if not used with care. Actually here it is used recklessly because there is no checks if len
is within bounds. In this case it is not. It is expected that radio layers do not give garbage to upper application layers, therefore developers often do not check explicitly if len
is within bounds, especially if for example payload or whole message_t
is copied to another message_t
location.
If this last line is executed and len
is incorrect, system crashes every time. Fortunately the last log line that I saw showed the assertion.
My first hypothesis:
Someone overwrites
message_t
length byte.
With next round some more logic was added to get more information about this faulty packet. Its length was 248 every single time if error happened. errm
is special debug function that prints whole message_t
structure to serial.
ASSERT_GREATER_EQUAL(call Packet.maxPayloadLength(), len);
if(len == 248) {
errm("msg", msg);
}
2014-01-28 14:43:23.90 : 'E| App: 89|ASSERT [[86]] call Packet.maxPayloadLength() not >= len [[248]]'
2014-01-28 14:43:23.92 : 'E| App: 94|0588 61130000 8FE2121F 3F710301 00281F12 AF830304 BD00BB48 3146DA60 18A4C500 18A31400 628C5020 C5262800 C52E2801 31498A00 31494C05 C5262900 525300A4 A0414500 1494C028 A402514C 028A0028 A401494C 02928105 14005140 09453012'
What caught immediately my eye was first byte in the message. It was 0x05
meaning that received packet should be only 5 bytes long, not 248 as advertised by receive event. So I took another look into radio headers.
typedef nx_struct rfa1_header_t {
nxle_uint8_t length;
} rfa1_header_t;
typedef nx_struct ieee154_simple_header_t {
nxle_uint16_t fcf;
nxle_uint8_t dsn;
nxle_uint16_t destpan;
nxle_uint16_t dest;
nxle_uint16_t src;
} ieee154_simple_header_t;
typedef nx_struct rfa1packet_header_t {
rfa1_header_t rfa1;
ieee154_simple_header_t ieee154;
#ifndef TFRAMES_ENABLED
network_header_t network;
#endif
#ifndef IEEE154FRAMES_ENABLED
activemessage_header_t am;
#endif
} rfa1packet_header_t;
Indeed length is first byte and all other meaningful bytes like source and destination address were what I expected them to be. It looked like perfectly legal packet except its first byte was corrupted. This finding made my hypothesis stronger. It seemed like of by one error. Somewhere in application layer must be bug that involves message_t
coping or there must be something bad happening with pointer arithmetics. But how to find out where it is happening. It is not a trivial application. It has multihop layers that are not yet ready for prime time and other complex layers that are dealing with radio packets. It would be nice to see if there are some other buffers near faulty packet.
Object dump #
avr-objdump -t main.exe > dump.txt
Now I was able examine all the symbols and their address. It turned out that faulty message address was from PoolP__4__pool
.
008008f4 g O .bss 00000001 SendReceiveP__m_tx_busy
0080090c g O .bss 00000001 Atm128UartHdlcPacketP__0__m_rx_receiving
00800fbf g O .bss 00000504 PoolP__4__pool
008014c3 g O .bss 00000002 MoteSearchP__searchAddress
008008af g O .bss 00000001 TraceRouteM__m_sequential_send_error_count
00800233 g O .data 00000008 AMAddrFilterM__filtered
008014c5 g O .bss 0000000d HalChipconControlP__status
008008fb g O .bss 00000001 HalChipconControlP__bytesToSend
But which component uses this pool? Quick lookup from app.c
revealed that it belonged to:
/*SourceRoutingC.Engine.MessagePool.PoolP*/PoolP__4__pool_t
However there was nothing special in SourceRoutingC
. Moreover faulty message_t
was allocated from PoolC and it was one of the middle elements, so I could not do the padding trick, because those messages were allocated as continuous memory.
Padding trick #
If it has been just simple allocation somewhere like this:
message_t msg;
I would have done padding trick:
uint8_t paddingdata1[10];
message_t msg;
uint8_t paddingdata2[10];
memset(paddingdata1, 0xAA, 10);
memset(paddingdata2, 0xAA, 10);
Padding trick might have revealed that indeed someone has written out of bounds if those special 0xAA
symbols were also corrupted in our arrays. Unfortunately I could not do that.
Next try #
TinyOS buffer swaps receive buffer with application and radio layer. What if it is possible to put some markers into every message_t
that is swapped with radio? Maybe this reveals something. I know a bit about TinyOS rfxlayers. So it was quite obvious choice to check if something can be done in MessageBufferLayerP
. memset
was added after msg = signal Receive.receive(msg)
.
atomic
{
if( receiveQueueSize == 0 )
return;
msg = receiveQueue[receiveQueueHead];
}
msg = signal Receive.receive(msg);
memset(msg, 0x11, sizeof(message_t));
Idea was to log all received packets in application layer to see if something interesting shows up.
Application purpose was to relay packets. After memset
trick packet sniffer showed something strange. Correct packet was received (length 248 error did not happen all the time) however scrambled packet was sent out. If memset
was not compiled in, forwarded packet was also correct. This finding lead to second hypothesis:
Someone must hold pointer for message that has been given back to radio layer.
It made a lot of sense. Packet sniffer showed that last 4 payload bytes were corrupted. How this can only happen if memset
trick is compiled in? Some application layer must use this pointer and modify its contents. More strangely those last bytes changed like a clock work. Here they are copied to iPython for analysis.
Division with 20 s is made because those packets were coming with 20 s interval. The result 62205 was something similar to 62.5 kHz which is Atmega128RFA1 MAC counter frequency. TinyOS timers are based on that counter. It was no time to think if this even made sense. Last 40 s interval calculation revealed twice the number. Definitely those last bytes were some sort of timestamps but which layers put them there?
Rush #
After initial adrenaline rush it was time to log also outgoing message_t
contents. At first it did not seemed to have any difference except 2 bytes in metadata were changed. Packet timestamp also resides in message_t
metadata fields, so this must be related to the situation. Here is metadata struct:
typedef struct rfa1packet_metadata_t
{
#ifdef LOW_POWER_LISTENING
lpl_metadata_t lpl;
#endif
#ifdef PACKET_LINK
link_metadata_t link;
#endif
timestamp_metadata_t timestamp;
flags_metadata_t flags;
rfa1_metadata_t rfa1;
} rfa1packet_metadata_t;
Snap, those 2 bytes were lpl
bytes that are modified if LPL is enabled. They were not related to timestamp. Indeed application layer gave right packet to radio layers. It had the same contents as received packet except lpl
field which is okay to differ. This finding led to the next hypothesis:
Packet is corrupted somewhere in rfxlink layers.
Packet logging was added to TinyosNetworkLayerP and SoftwareAckLayerP. TimeStampingLayerC would have been good canditate but it did not have send interface. After testing - nothing - packet was still correct inside device but corrupted in the air. At least that was what sniffer reported. Binary search. I added logging to the very last layer RFA1DriverLayerP:
tasklet_async command error_t RadioSend.send(message_t* msg)
{
uint32_t time;
uint8_t length;
uint8_t* data;
void* timesync;
errm2("", msg);
....
To my surprise it was still correct. It was 21:00 already and there aren’t many ideas left. I started debugging at 16:00 and mean time my colleague joined the session. We were staring this function and felt completely numb. Probably comment //fix the time stamp
would have been very alarming if we had been reached this point 3 hours ago.
// do something useful, just to wait a little
timesync = call PacketTimeSyncOffset.isSet(msg) ? ((void*)msg) + call PacketTimeSyncOffset.get(msg) : 0;
...
// fix the time stamp
if( timesync != 0 )
*(timesync_relative_t*)timesync = (*(timesync_absolute_t*)timesync) - time;
// then upload the whole payload
memcpy((void*)(&TRXFBST+1), data, length);
There we were. It was not yet in the middle of the night, but this hunt had been lasted already way too long. Looking this code revealed a ugly case where indeed payload was corrupted.
async command bool PacketTimeSyncOffset.isSet(message_t* msg) {
return call TimeSyncFlag.get(msg);
}
async command uint8_t PacketTimeSyncOffset.get(message_t* msg) {
return call RadioPacket.headerLength(msg) + call RadioPacket.payloadLength(msg) - sizeof(timesync_absolute_t);
}
If for some reason TimeSyncFlag was set in metadata field, timestamp was written to payload right before sending packet out. It overwrites last payload bytes, not sure if this is bug or feature. The question is why was this flag set only if we did the memset
trick. 0x11 was very unlucky number that set this flag in message_t
metadata. That is why it only happened with memset
trick.
Conclusion #
Initial length 248 bug led to another bug that seemed to be more serious. However, actually this monster was created from thin air.
My colleague gave me a paper about system engineers after this night: http://research.microsoft.com/en-us/people/mickens/thenightwatch.pdf
NB: there have been some progress with length 248 bug, I hope to describe the solution in next post.