Standalone Assembly Files And Inlining
Every so often, when the compiler is slacking, I pop the hood and get my hands dirty with assembly.
My modus operandi is to write the optimized function in an .asm
, assemble it with NASM and add it to the build, along with other .asm
functions, as a static library.
The problem with this approach is that, to my knowledge, GCC, Clang and MSVC are not able to apply Link Time Optimization when working with object files that have been compiled from assembly.
This means that even simple procedures or functions are not inlined, and have to pay the overhead associated with a function call.
A simple example
Ignoring for a moment the fact that the every compiler exposes a builtin that provides this exact functionality (e.g. __builtin_ia32_rdtsc()
), imagine you needed to add a Time Stamp Counter function to your application.
The following is a dummy application that uses the RDTSC function (the multiply between GetRDTSC
and argc
is there to avoid the compiler optimizing it away):
#ifdef __cplusplus
extern "C"
#endif
unsigned long GetRDTSC();
int
main(int argc, char** argv)
{
int Result = (int)(GetRDTSC() * argc);
return Result;
}
In order to add the functionality you would have two approaches:
- compile the function as a standalone assembly file and add the object file to your build
- use inline assembly (if your compiler supports it)
Standalone assembly
Standalons assembly works on all compilers and, using AT&T syntax, looks like this:
.global GetRDTSC
GetRDTSC:
rdtsc // read time-stamp counter into EDX:EAX
shl $32, %rdx // combine into RAX register
or %rdx, %rax
ret
The output produced by GCC and Clang (MSVC produces something similar) when compiled with all optimizations (-O3 -flto
) is:
Dump of assembler code for function main:
0x00000000004004a0 <+0>: push %rbx
0x00000000004004a1 <+1>: mov %edi,%ebx
0x00000000004004a3 <+3>: callq 0x400488 <GetRDTSC>
0x00000000004004a8 <+8>: imul %ebx,%eax
0x00000000004004ab <+11>: pop %rbx
0x00000000004004ac <+12>: retq
As you can see, the function has not been inlined, despite begin just three instructions, because (I speculate) the assembler doesn’t decorate the object file with the metadata required by linker in order to optimize the function away when going through link time optimization.
The inline assembly way
When using inline assembly, on the other hand, things work as expected, and the resulting binary is optimized correctly.
This is the inline equivalent of the previous GetRDTSC
function, using the GCC Extended Asm syntax:
static unsigned long
GetRDTSC()
{
// read time-stamp counter into EDX:EAX
// then combine into RAX register
unsigned Low, High;
__asm__("rdtsc" : "=a"(Low), "=d" (High));
return ((unsigned long)Low) | (((unsigned long)High) << 32);
}
And this is the disassembled main
function:
Dump of assembler code for function main:
0x00000000004003b0 <+0>: rdtsc
0x00000000004003b2 <+2>: shl $0x20,%rdx
0x00000000004003b6 <+6>: or %rdx,%rax
0x00000000004003b9 <+9>: imul %edi,%eax
0x00000000004003bc <+12>: retq
Perfect! The function call and associated stack manipulation are gone, and only the essential instructions remain.
This is the superior approach for compact procedures that are invoked frequently, but can quickly become impossible to maintain as the complexity of the assembly function rises.
In Rival Fortress I use a mixture of both approaches: I initially write and debug assembly functions as standalone files, and during profiling sessions I decided if it is worth it to inline them.
NOTE: Inline assembly doesn’t work on MSVC when compiling for x64 targets, so you are out of luck if that is your compiler of choice.