Every so often, when the compiler is slacking, I pop the hood and get my hands dirty with assembly.
My modus operandi is to write the optimized function in an
.asm, assemble it with NASM and add it to the build, along with other
.asm functions, as a static library.
The problem with this approach is that, to my knowledge, GCC, Clang and MSVC are not able to apply Link Time Optimization when working with object files that have been compiled from assembly.
This means that even simple procedures or functions are not inlined, and have to pay the overhead associated with a function call.
A simple example
Ignoring for a moment the fact that the every compiler exposes a builtin that provides this exact functionality (e.g.
__builtin_ia32_rdtsc()), imagine you needed to add a Time Stamp Counter function to your application.
The following is a dummy application that uses the RDTSC function (the multiply between
argc is there to avoid the compiler optimizing it away):
In order to add the functionality you would have two approaches:
- compile the function as a standalone assembly file and add the object file to your build
- use inline assembly (if your compiler supports it)
Standalons assembly works on all compilers and, using AT&T syntax, looks like this:
The output produced by GCC and Clang (MSVC produces something similar) when compiled with all optimizations (
-O3 -flto) is:
As you can see, the function has not been inlined, despite begin just three instructions, because (I speculate) the assembler doesn’t decorate the object file with the metadata required by linker in order to optimize the function away when going through link time optimization.
The inline assembly way
When using inline assembly, on the other hand, things work as expected, and the resulting binary is optimized correctly.
This is the inline equivalent of the previous
GetRDTSC function, using the GCC Extended Asm syntax:
And this is the disassembled
Perfect! The function call and associated stack manipulation are gone, and only the essential instructions remain.
This is the superior approach for compact procedures that are invoked frequently, but can quickly become impossible to maintain as the complexity of the assembly function rises.
In Rival Fortress I use a mixture of both approaches: I initially write and debug assembly functions as standalone files, and during profiling sessions I decided if it is worth it to inline them.
NOTE: Inline assembly doesn’t work on MSVC when compiling for x64 targets, so you are out of luck if that is your compiler of choice.