Metric Panda Games

One pixel at a time.

Standalone Assembly Files And Inlining

Rival Fortress Update #43

Every so often, when the compiler is slacking, I pop the hood and get my hands dirty with assembly.

My modus operandi is to write the optimized function in an .asm, assemble it with NASM and add it to the build, along with other .asm functions, as a static library.

The problem with this approach is that, to my knowledge, GCC, Clang and MSVC are not able to apply Link Time Optimization when working with object files that have been compiled from assembly.

This means that even simple procedures or functions are not inlined, and have to pay the overhead associated with a function call.

A simple example

Ignoring for a moment the fact that the every compiler exposes a builtin that provides this exact functionality (e.g. __builtin_ia32_rdtsc()), imagine you needed to add a Time Stamp Counter function to your application.

The following is a dummy application that uses the RDTSC function (the multiply between GetRDTSC and argc is there to avoid the compiler optimizing it away):

#ifdef __cplusplus
extern "C"
#endif
unsigned long GetRDTSC();

int
main(int argc, char** argv)
{
  int Result = (int)(GetRDTSC() * argc);
  return Result;
}

In order to add the functionality you would have two approaches:

  • compile the function as a standalone assembly file and add the object file to your build
  • use inline assembly (if your compiler supports it)

Standalone assembly

Standalons assembly works on all compilers and, using AT&T syntax, looks like this:

.global GetRDTSC

GetRDTSC:
  rdtsc         // read time-stamp counter into EDX:EAX
  shl $32, %rdx // combine into RAX register
  or %rdx, %rax
  ret

The output produced by GCC and Clang (MSVC produces something similar) when compiled with all optimizations (-O3 -flto) is:

 Dump of assembler code for function main:
   0x00000000004004a0 <+0>:     push   %rbx
   0x00000000004004a1 <+1>:     mov    %edi,%ebx
   0x00000000004004a3 <+3>:     callq  0x400488 <GetRDTSC>
   0x00000000004004a8 <+8>:     imul   %ebx,%eax
   0x00000000004004ab <+11>:    pop    %rbx
   0x00000000004004ac <+12>:    retq

As you can see, the function has not been inlined, despite begin just three instructions, because (I speculate) the assembler doesn’t decorate the object file with the metadata required by linker in order to optimize the function away when going through link time optimization.

The inline assembly way

When using inline assembly, on the other hand, things work as expected, and the resulting binary is optimized correctly.

This is the inline equivalent of the previous GetRDTSC function, using the GCC Extended Asm syntax:

static unsigned long
GetRDTSC()
{
  // read time-stamp counter into EDX:EAX
  // then combine into RAX register
  unsigned Low, High;
  __asm__("rdtsc" : "=a"(Low), "=d" (High));
  return ((unsigned long)Low) | (((unsigned long)High) << 32);
}

And this is the disassembled main function:

 Dump of assembler code for function main:
   0x00000000004003b0 <+0>:     rdtsc
   0x00000000004003b2 <+2>:     shl    $0x20,%rdx
   0x00000000004003b6 <+6>:     or     %rdx,%rax
   0x00000000004003b9 <+9>:     imul   %edi,%eax
   0x00000000004003bc <+12>:    retq

Perfect! The function call and associated stack manipulation are gone, and only the essential instructions remain.

This is the superior approach for compact procedures that are invoked frequently, but can quickly become impossible to maintain as the complexity of the assembly function rises.

In Rival Fortress I use a mixture of both approaches: I initially write and debug assembly functions as standalone files, and during profiling sessions I decided if it is worth it to inline them.

NOTE: Inline assembly doesn’t work on MSVC when compiling for x64 targets, so you are out of luck if that is your compiler of choice.