# Portable Unaligned Memory Access for ARM Cortex - published: 2019-08-26T18:05:45+0000 - updated: 2021-05-07T18:05:45+0000 - tags: cpp, c, gcc, programming, arm, cortex One of the interesting issues I recently had to solve manifested while I was porting a piece of a software, which usually ran either on a PC or an ARM Cortex M4 processor, to an ARM Cortex M0 processor, which seemingly unexpectedly ended up in an hard fault while running the code, which worked fine on other platforms and is even covered with unit tests. After looking at the call stack after the hard fault, the culprit seemed to be a routine, which looked something as follows: ```cpp void encode( std::uint8_t * destination, const std::uint8_t * source, std::size_t length) { const std::uint8_t * end = source + length; while (source < end) { *reinterpret_cast(destination) = encoding_table[*source]; destination += sizeof(std::uint16_t); source++; } } ``` I did have a vague idea about the M0 architecture not being able to perform unaligned memory accesses and after a minute of debugging, I did discover the routine above was indeed being called with `destination` parameter pointing to an unaligned memory location and performing word-store operation on that address. This is usually not an issue, apart from slight performance penalty (but we are not here to micro-optimize the code), however ARM Cortex M0 in fact can not perform these type of operations. The problem stems from the fact that the GCC compiler expects the *values* of pointers to be correctly aligned (depending on the type of the pointer). While this is normally reasonable assumption necessary for allowing reasonable performance of the generated code, in this particular case it causes the above-mentioned issue. The naive solution to this problem would be to provide the specific implementation of the code for the ARM Cortex M0 processor, however I did not want to segment the code due to the maintainability concerns. The recommended fix for the code above would be to use `std::memcpy()` as follows. ```cpp void encode( std::uint8_t * destination, const std::uint8_t * source, std::size_t length) { const std::uint8_t * end = source + length; while (source < end) { std::memcpy(destination, encoding_table + (*source), sizeof(std::uint16_t)); destination += sizeof(std::uint16_t); source++; } } ``` While this works well on ARM Cortex M4, on M0 the compiler generates an actual call to `std::memcpy()` function in every iteration of the loop. This does not feel right to me, since this code is quite speed-critical. After some research I came up with a solution that seemed satisfactory for me. The solution is to explicitly let the compiler know about the fact that it might be accessing unaligned memory locations and let it generate the appropriate code for given architecture. This can be achieved on GCC compilers as follows: ```cpp void encode( std::uint8_t * destination, const std::uint8_t * source, std::size_t length) { const std::uint8_t * end = source + length; struct packed_uint16_t { std::uint16_t value; } __attribute__((packed)) __attribute__((__may_alias__)); while (source < end) { reinterpret_cast(destination)->value = encoding_table[*source]; destination += sizeof(std::uint16_t); source++; } } ``` Since notion of packed structures exists on many compilers the above code can be made even more portable by using some preprocessor magic. This code also addresses aliasing issues that may arise from the type punning in the original code, while prevents unnecessary calls to `std::memcpy()` on ARM Cortex M0. The solution worked fine on the ARM Cortex M0, however I did have to investigate whether this modification affected other platforms in some way. I did this by disassembling the code generated (with at least debug `-Og` optimizations turned on) for the ARM Cortex M4. Below is the code generated form the original version of the function. ```armasm : add r2, r1 cmp r1, r2 bcs.n exit push {r4} loop: ldrb.w r4, [r1], #1 ldr r3, [pc, #16] ldrh.w r3, [r3, r4, lsl #1] strh.w r3, [r0], #2 cmp r1, r2 bcc.n loop pop {r4} exit: bx lr ``` Now follows the disassembly of the code generated for the ARM Cortex M4 form the modified version of the function. ```armasm : add r2, r1 cmp r1, r2 bcs.n exit push {r4} loop: ldrb.w r4, [r1], #1 ldr r3, [pc, #16] ldrh.w r3, [r3, r4, lsl #1] strh.w r3, [r0], #2 cmp r1, r2 bcc.n loop pop {r4} exit: bx lr ``` As many keen-eyed readers might have noticed the instructions generated by my compiler are indeed identical, even though the source code differs. I found this result very satisfying. When the second C++ code is, however, compiled for ARM Cortex M0, the generated code correctly handles the possibility of working with unaligned memory, as shown below. ```armasm : push {r4, lr} adds r2, r1, r2 loop: cmp r1, r2 bcs.n exit ldrb r3, [r1, #0] lsls r3, r3, #1 ldr r4, [pc, #16] ldrh r3, [r3, r4] strb r3, [r0, #0] ; Store lower-byte lsrs r3, r3, #8 strb r3, [r0, #1] ; Store upper-byte adds r0, #2 adds r1, #1 b.n loop exit: pop {r4, pc} nop .word 0x00000000 ``` ### Update Summary * Add notes on type punning using `std::memcpy()` * Add disassembly of the ARM Cortex M0 binary