Debugging data corruption issue in a distributed system written for TinyOS

TinyOS is operating system that is used for developing applications for distributed wireless sensor networks. That is what academics used it for. However in these days it is more popular to call it operating system for Internet of things (IoT). TinyOS abstractions are quite nice to use if everything works as expected, however if you have to check under the hood hell breaks loose.

The problem #

The problem with my application is that occasionally it receives a packet where payload length is claimed to be 248 bytes long. However application TOSH_DATA_LENGTH is just 86 bytes and Atmega128RFA1 radio buffer is 127 bytes long. Therefore it is impossible that this kind of packet can be received from air.
Fortunately there were asserts in the module where problematic packet reception was caught.

event message_t* Radio.receive[uint8_t id](message_t* msg, void * payload, uint8_t len) {
    ASSERT(msg != NULL);
    ASSERT(payload != NULL);
    ASSERT_GREATER_EQUAL(call Packet.maxPayloadLength(), len);
    memmove(call Packet.getPayload(msg, len), payload, len);

Notice that there is also memmove involved which will do unexpected damage in embedded system if not used with care. Actually here it is used recklessly because there is no checks if len is within bounds. In this case it is not. It is expected that radio layers do not give garbage to upper application layers, therefore developers often do not check explicitly if len is within bounds, especially if for example payload or whole message_t is copied to another message_t location.
If this last line is executed and len is incorrect, system crashes every time. Fortunately the last log line that I saw showed the assertion.

My first hypothesis:

Someone overwrites message_t length byte.

With next round some more logic was added to get more information about this faulty packet. Its length was 248 every single time if error happened. errm is special debug function that prints whole message_t structure to serial.

ASSERT_GREATER_EQUAL(call Packet.maxPayloadLength(), len);
if(len == 248) {
    errm("msg", msg);
}

2014-01-28 14:43:23.90 : 'E| App:  89|ASSERT [[86]] call Packet.maxPayloadLength() not >= len [[248]]'
2014-01-28 14:43:23.92 : 'E| App:  94|0588 61130000 8FE2121F 3F710301 00281F12 AF830304 BD00BB48 3146DA60 18A4C500 18A31400 628C5020 C5262800 C52E2801 31498A00 31494C05 C5262900 525300A4 A0414500 1494C028 A402514C 028A0028 A401494C 02928105 14005140 09453012'

What caught immediately my eye was first byte in the message. It was 0x05 meaning that received packet should be only 5 bytes long, not 248 as advertised by receive event. So I took another look into radio headers.

typedef nx_struct rfa1_header_t {
    nxle_uint8_t length;
} rfa1_header_t;

typedef nx_struct ieee154_simple_header_t {
    nxle_uint16_t fcf;
    nxle_uint8_t dsn;
    nxle_uint16_t destpan;
    nxle_uint16_t dest;
    nxle_uint16_t src;
} ieee154_simple_header_t;

typedef nx_struct rfa1packet_header_t {
    rfa1_header_t rfa1;
    ieee154_simple_header_t ieee154;
#ifndef TFRAMES_ENABLED
    network_header_t network;
#endif
#ifndef IEEE154FRAMES_ENABLED
    activemessage_header_t am;
#endif
} rfa1packet_header_t;

Indeed length is first byte and all other meaningful bytes like source and destination address were what I expected them to be. It looked like perfectly legal packet except its first byte was corrupted. This finding made my hypothesis stronger. It seemed like of by one error. Somewhere in application layer must be bug that involves message_t coping or there must be something bad happening with pointer arithmetics. But how to find out where it is happening. It is not a trivial application. It has multihop layers that are not yet ready for prime time and other complex layers that are dealing with radio packets. It would be nice to see if there are some other buffers near faulty packet.

Object dump #

avr-objdump -t main.exe > dump.txt

Now I was able examine all the symbols and their address. It turned out that faulty message address was from PoolP__4__pool.

008008f4 g     O .bss   00000001 SendReceiveP__m_tx_busy
0080090c g     O .bss   00000001 Atm128UartHdlcPacketP__0__m_rx_receiving
00800fbf g     O .bss   00000504 PoolP__4__pool
008014c3 g     O .bss   00000002 MoteSearchP__searchAddress
008008af g     O .bss   00000001 TraceRouteM__m_sequential_send_error_count
00800233 g     O .data  00000008 AMAddrFilterM__filtered
008014c5 g     O .bss   0000000d HalChipconControlP__status
008008fb g     O .bss   00000001 HalChipconControlP__bytesToSend

But which component uses this pool? Quick lookup from app.crevealed that it belonged to:

/*SourceRoutingC.Engine.MessagePool.PoolP*/PoolP__4__pool_t

However there was nothing special in SourceRoutingC. Moreover faulty message_t was allocated from PoolC and it was one of the middle elements, so I could not do the padding trick, because those messages were allocated as continuous memory.

Padding trick #

If it has been just simple allocation somewhere like this:

message_t msg; 

I would have done padding trick:

 uint8_t paddingdata1[10];
 message_t msg; 
 uint8_t paddingdata2[10];
 memset(paddingdata1, 0xAA, 10);
 memset(paddingdata2, 0xAA, 10);

Padding trick might have revealed that indeed someone has written out of bounds if those special 0xAA symbols were also corrupted in our arrays. Unfortunately I could not do that.

Next try #

TinyOS buffer swaps receive buffer with application and radio layer. What if it is possible to put some markers into every message_t that is swapped with radio? Maybe this reveals something. I know a bit about TinyOS rfxlayers. So it was quite obvious choice to check if something can be done in MessageBufferLayerP. memset was added after msg = signal Receive.receive(msg).

atomic
{
    if( receiveQueueSize == 0 )
        return;

    msg = receiveQueue[receiveQueueHead];
}

msg = signal Receive.receive(msg);
memset(msg, 0x11, sizeof(message_t));

Idea was to log all received packets in application layer to see if something interesting shows up.
Application purpose was to relay packets. After memset trick packet sniffer showed something strange. Correct packet was received (length 248 error did not happen all the time) however scrambled packet was sent out. If memset was not compiled in, forwarded packet was also correct. This finding lead to second hypothesis:

Someone must hold pointer for message that has been given back to radio layer.

It made a lot of sense. Packet sniffer showed that last 4 payload bytes were corrupted. How this can only happen if memset trick is compiled in? Some application layer must use this pointer and modify its contents. More strangely those last bytes changed like a clock work. Here they are copied to iPython for analysis.

timestamps.png

Division with 20 s is made because those packets were coming with 20 s interval. The result 62205 was something similar to 62.5 kHz which is Atmega128RFA1 MAC counter frequency. TinyOS timers are based on that counter. It was no time to think if this even made sense. Last 40 s interval calculation revealed twice the number. Definitely those last bytes were some sort of timestamps but which layers put them there?

Rush #

After initial adrenaline rush it was time to log also outgoing message_t contents. At first it did not seemed to have any difference except 2 bytes in metadata were changed. Packet timestamp also resides in message_t metadata fields, so this must be related to the situation. Here is metadata struct:

typedef struct rfa1packet_metadata_t
{
#ifdef LOW_POWER_LISTENING
    lpl_metadata_t lpl;
#endif
#ifdef PACKET_LINK
    link_metadata_t link;
#endif
    timestamp_metadata_t timestamp;
    flags_metadata_t flags;
    rfa1_metadata_t rfa1;
} rfa1packet_metadata_t;

Snap, those 2 bytes were lpl bytes that are modified if LPL is enabled. They were not related to timestamp. Indeed application layer gave right packet to radio layers. It had the same contents as received packet except lpl field which is okay to differ. This finding led to the next hypothesis:

Packet is corrupted somewhere in rfxlink layers.

Packet logging was added to TinyosNetworkLayerP and SoftwareAckLayerP. TimeStampingLayerC would have been good canditate but it did not have send interface. After testing - nothing - packet was still correct inside device but corrupted in the air. At least that was what sniffer reported. Binary search. I added logging to the very last layer RFA1DriverLayerP:

tasklet_async command error_t RadioSend.send(message_t* msg)
{
    uint32_t time;
    uint8_t length;
    uint8_t* data;
    void* timesync;

    errm2("", msg);
    ....

To my surprise it was still correct. It was 21:00 already and there aren’t many ideas left. I started debugging at 16:00 and mean time my colleague joined the session. We were staring this function and felt completely numb. Probably comment //fix the time stamp would have been very alarming if we had been reached this point 3 hours ago.

// do something useful, just to wait a little
timesync = call PacketTimeSyncOffset.isSet(msg) ? ((void*)msg) + call PacketTimeSyncOffset.get(msg) : 0;

...
// fix the time stamp
if( timesync != 0 )
  *(timesync_relative_t*)timesync = (*(timesync_absolute_t*)timesync) - time;

// then upload the whole payload
memcpy((void*)(&TRXFBST+1), data, length);

There we were. It was not yet in the middle of the night, but this hunt had been lasted already way too long. Looking this code revealed a ugly case where indeed payload was corrupted.

async command bool PacketTimeSyncOffset.isSet(message_t* msg) {
    return call TimeSyncFlag.get(msg);
}

async command uint8_t PacketTimeSyncOffset.get(message_t* msg) {
    return call RadioPacket.headerLength(msg) + call RadioPacket.payloadLength(msg) - sizeof(timesync_absolute_t);
}

If for some reason TimeSyncFlag was set in metadata field, timestamp was written to payload right before sending packet out. It overwrites last payload bytes, not sure if this is bug or feature. The question is why was this flag set only if we did the memset trick. 0x11 was very unlucky number that set this flag in message_t metadata. That is why it only happened with memset trick.

Conclusion #

Initial length 248 bug led to another bug that seemed to be more serious. However, actually this monster was created from thin air.
My colleague gave me a paper about system engineers after this night: http://research.microsoft.com/en-us/people/mickens/thenightwatch.pdf

NB: there have been some progress with length 248 bug, I hope to describe the solution in next post.

Update 05.02.14 #

Here is the fix I promised

 
5
Kudos
 
5
Kudos

Now read this

Convert subprocess stdout stream into non-blocking iterator in Rust

In one of my programs I had to interact with another subprocess. This subprocess took data from stdin and wrote result to stdout. It wasn’t just simple reading and writing - it took constant data stream from stdin and somewhere in the... Continue →