Portable Unaligned Memory Access for ARM Cortex
One of the interesting issues I recently had to solve manifested while I was porting a piece of a software, which usually ran either on a PC or an ARM Cortex M4 processor, to an ARM Cortex M0 processor, which seemingly unexpectedly ended up in an hard fault while running the code, which worked fine on other platforms and is even covered with unit tests.
After looking at the call stack after the hard fault, the culprit seemed to be a routine, which looked something as follows:
void encode(
std::uint8_t * destination,
const std::uint8_t * source,
std::size_t length)
{
const std::uint8_t * end = source + length;
while (source < end)
{
*reinterpret_cast<std::uint16_t *>(destination) =
encoding_table[*source];
destination += sizeof(std::uint16_t);
source++;
}
}
I did have a vague idea about the M0 architecture not being able to perform
unaligned memory accesses and after a minute of debugging, I did discover the
routine above was indeed being called with destination
parameter pointing to
an unaligned memory location and performing word-store operation on that
address. This is usually not an issue, apart from slight performance penalty
(but we are not here to micro-optimize the code), however ARM Cortex M0 in fact
can not perform these type of operations.
The problem stems from the fact that the GCC compiler expects the values of pointers to be correctly aligned (depending on the type of the pointer). While this is normally reasonable assumption necessary for allowing reasonable performance of the generated code, in this particular case it causes the above-mentioned issue.
The naive solution to this problem would be to provide the specific implementation of the code for the ARM Cortex M0 processor, however I did not want to segment the code due to the maintainability concerns.
The recommended fix for the code above would be to use std::memcpy()
as
follows.
void encode(
std::uint8_t * destination,
const std::uint8_t * source,
std::size_t length)
{
const std::uint8_t * end = source + length;
while (source < end)
{
std::memcpy(destination, encoding_table + (*source), sizeof(std::uint16_t));
destination += sizeof(std::uint16_t);
source++;
}
}
While this works well on ARM Cortex M4, on M0 the compiler generates an actual
call to std::memcpy()
function in every iteration of the loop. This does not
feel right to me, since this code is quite speed-critical.
After some research I came up with a solution that seemed satisfactory for me. The solution is to explicitly let the compiler know about the fact that it might be accessing unaligned memory locations and let it generate the appropriate code for given architecture. This can be achieved on GCC compilers as follows:
void encode(
std::uint8_t * destination,
const std::uint8_t * source,
std::size_t length)
{
const std::uint8_t * end = source + length;
struct packed_uint16_t
{
std::uint16_t value;
} __attribute__((packed)) __attribute__((__may_alias__));
while (source < end)
{
reinterpret_cast<packed_uint16_t *>(destination)->value =
encoding_table[*source];
destination += sizeof(std::uint16_t);
source++;
}
}
Since notion of packed structures exists on many compilers the above code can be
made even more portable by using some preprocessor magic. This code also
addresses aliasing issues that may arise from the type punning in the original
code, while prevents unnecessary calls to std::memcpy()
on ARM Cortex M0.
The solution worked fine on the ARM Cortex M0, however I did have to investigate
whether this modification affected other platforms in some way. I did this by
disassembling the code generated (with at least debug -Og
optimizations turned
on) for the ARM Cortex M4. Below is the code generated form the original version
of the function.
<encode(unsigned char*, unsigned char const*, unsigned int)>:
add r2, r1
cmp r1, r2
bcs.n exit
push {r4}
loop:
ldrb.w r4, [r1], #1
ldr r3, [pc, #16]
ldrh.w r3, [r3, r4, lsl #1]
strh.w r3, [r0], #2
cmp r1, r2
bcc.n loop
pop {r4}
exit:
bx lr
Now follows the disassembly of the code generated for the ARM Cortex M4 form the modified version of the function.
<encode(unsigned char*, unsigned char const*, unsigned int)>:
add r2, r1
cmp r1, r2
bcs.n exit
push {r4}
loop:
ldrb.w r4, [r1], #1
ldr r3, [pc, #16]
ldrh.w r3, [r3, r4, lsl #1]
strh.w r3, [r0], #2
cmp r1, r2
bcc.n loop
pop {r4}
exit:
bx lr
As many keen-eyed readers might have noticed the instructions generated by my compiler are indeed identical, even though the source code differs. I found this result very satisfying. When the second C++ code is, however, compiled for ARM Cortex M0, the generated code correctly handles the possibility of working with unaligned memory, as shown below.
<encode(unsigned char*, unsigned char const*, unsigned int)>:
push {r4, lr}
adds r2, r1, r2
loop:
cmp r1, r2
bcs.n exit
ldrb r3, [r1, #0]
lsls r3, r3, #1
ldr r4, [pc, #16]
ldrh r3, [r3, r4]
strb r3, [r0, #0] ; Store lower-byte
lsrs r3, r3, #8
strb r3, [r0, #1] ; Store upper-byte
adds r0, #2
adds r1, #1
b.n loop
exit:
pop {r4, pc}
nop
.word 0x00000000
Update Summary
- Add notes on type punning using
std::memcpy()
- Add disassembly of the ARM Cortex M0 binary