Portable Unaligned Memory Access for ARM Cortex

One of the interesting issues I recently had to solve manifested while I was porting a piece of a software, which usually ran either on a PC or an ARM Cortex M4 processor, to an ARM Cortex M0 processor, which seemingly unexpectedly ended up in an hard fault while running the code, which worked fine on other platforms and is even covered with unit tests.

After looking at the call stack after the hard fault, the culprit seemed to be a routine, which looked something as follows:

void encode(
    std::uint8_t * destination,
    const std::uint8_t * source,
    std::size_t length)
{
    const std::uint8_t * end = source + length;

    while (source < end)
    {
        *reinterpret_cast<std::uint16_t *>(destination) =
                encoding_table[*source];
        destination += sizeof(std::uint16_t);
        source++;
    }
}

I did have a vague idea about the M0 architecture not being able to perform unaligned memory accesses and after a minute of debugging, I did discover the routine above was indeed being called with destination parameter pointing to an unaligned memory location and performing word-store operation on that address. This is usually not an issue, apart from slight performance penalty (but we are not here to micro-optimize the code), however ARM Cortex M0 in fact can not perform these type of operations.

The problem stems from the fact that the GCC compiler expects the values of pointers to be correctly aligned (depending on the type of the pointer). While this is normally reasonable assumption necessary for allowing reasonable performance of the generated code, in this particular case it causes the above-mentioned issue.

The naive solution to this problem would be to provide the specific implementation of the code for the ARM Cortex M0 processor, however I did not want to segment the code due to the maintainability concerns.

The recommended fix for the code above would be to use std::memcpy() as follows.

void encode(
    std::uint8_t * destination,
    const std::uint8_t * source,
    std::size_t length)
{
    const std::uint8_t * end = source + length;

    while (source < end)
    {
        std::memcpy(destination, encoding_table + (*source), sizeof(std::uint16_t));
        destination += sizeof(std::uint16_t);
        source++;
    }
}

While this works well on ARM Cortex M4, on M0 the compiler generates an actual call to std::memcpy() function in every iteration of the loop. This does not feel right to me, since this code is quite speed-critical.

After some research I came up with a solution that seemed satisfactory for me. The solution is to explicitly let the compiler know about the fact that it might be accessing unaligned memory locations and let it generate the appropriate code for given architecture. This can be achieved on GCC compilers as follows:

void encode(
    std::uint8_t * destination,
    const std::uint8_t * source,
    std::size_t length)
{
    const std::uint8_t * end = source + length;

    struct packed_uint16_t
    {
        std::uint16_t value;
    }  __attribute__((packed)) __attribute__((__may_alias__));

    while (source < end)
    {
         reinterpret_cast<packed_uint16_t *>(destination)->value =
                 encoding_table[*source];
        destination += sizeof(std::uint16_t);
        source++;
    }
}

Since notion of packed structures exists on many compilers the above code can be made even more portable by using some preprocessor magic. This code also addresses aliasing issues that may arise from the type punning in the original code, while prevents unnecessary calls to std::memcpy() on ARM Cortex M0.

The solution worked fine on the ARM Cortex M0, however I did have to investigate whether this modification affected other platforms in some way. I did this by disassembling the code generated (with at least debug -Og optimizations turned on) for the ARM Cortex M4. Below is the code generated form the original version of the function.

<encode(unsigned char*, unsigned char const*, unsigned int)>:
  add     r2, r1
  cmp     r1, r2
  bcs.n   exit
  push    {r4}
loop:
  ldrb.w  r4, [r1], #1
  ldr     r3, [pc, #16]
  ldrh.w  r3, [r3, r4, lsl #1]
  strh.w  r3, [r0], #2
  cmp     r1, r2
  bcc.n   loop
  pop     {r4}
exit:
  bx      lr

Now follows the disassembly of the code generated for the ARM Cortex M4 form the modified version of the function.

<encode(unsigned char*, unsigned char const*, unsigned int)>:
  add     r2, r1
  cmp     r1, r2
  bcs.n   exit
  push    {r4}
loop:
  ldrb.w  r4, [r1], #1
  ldr     r3, [pc, #16]
  ldrh.w  r3, [r3, r4, lsl #1]
  strh.w  r3, [r0], #2
  cmp     r1, r2
  bcc.n   loop
  pop     {r4}
exit:
  bx      lr

As many keen-eyed readers might have noticed the instructions generated by my compiler are indeed identical, even though the source code differs. I found this result very satisfying. When the second C++ code is, however, compiled for ARM Cortex M0, the generated code correctly handles the possibility of working with unaligned memory, as shown below.

<encode(unsigned char*, unsigned char const*, unsigned int)>:
  push    {r4, lr}
  adds    r2, r1, r2
loop:
  cmp     r1, r2
  bcs.n   exit
  ldrb    r3, [r1, #0]
  lsls    r3, r3, #1
  ldr     r4, [pc, #16]
  ldrh    r3, [r3, r4]
  strb    r3, [r0, #0]  ; Store lower-byte
  lsrs    r3, r3, #8
  strb    r3, [r0, #1]  ; Store upper-byte
  adds    r0, #2
  adds    r1, #1
  b.n     loop
exit:
  pop     {r4, pc}
  nop
  .word   0x00000000

Update Summary