Metric Panda Games

One pixel at a time.

Chronicle of Optimization with perf and C

Rival Fortress Update #36

Note: You can find the source code for the optimized version of poly2tri on Github.

I needed a fast and robust mesh triangulation solution for Metric Panda Engine and after some research I found the excellent Poly2Tri library, an implementation in C++ of the Sweep-line algorithm for constrained Delaunay triangulation.

Metric Panda Engine is written in C-style C++, so I wanted to port the code for poly2tri and customize it to fit the engine’s needs.

While I ported the code I noticed opportunities for optimization, and in this post I’ll talk about what I did.

Preparing the library for profiling

Before I went ahead and made changes I wanted a baseline profile measurement.

Fortunately Poly2tri comes with a testbed with sample data that can be used to test the output of the library. The test application accepts a data file containing a list of point as input, runs the triangulation routine and displays the result using OpenGL.

The part I was interested in optimizing was the triangulation routine, so everything else that the test application did was noise that I had to minimize in order to get better readings while profiling.

To do so I modified the main.cc and did the following:

  • Removed the OpenGL visualization part
  • Isolated all library related function calls and placed them in a loop to reduce the noise from the startup of the testbed application

The relevant part of the refactored testbed looks like this:

  for (int Iteration = 0; Iteration < MAX_ITERATIONS; ++Iteration)
  {
    CDT cdt = CDT(polyline);
    cdt.Triangulate();
    ClearIteration(polyline);
  }

Note: The snippet above contained a call to new and delete in the initial version of the post. Creating the CDT on the stack saves the call to malloc and shaves off a little time from the benchmarked times.

Bird’s eye view with perf stat

With Poly2tri ready to be profiled I jumped to my trusty perf, the profiling swiss army knife every Linux dev should have in their toolbox. On Windows, the best solution in probably Intel’s VTune, but I don’t have any experience with it.

The first perf subcommand I always start with is stat:

taskset 2 perf stat --detailed --repeat 10 ./main testbed/data/debug2.dat 0 0 1

Several things are happening in this one-liner:

  • The taskset 2 will force the perf execution to happen on the second CPU core and avoid any random core migration that may muddy up the results. This can be safely done when dealing with single-threaded code, like it’s the case for Poly2tri.

  • The perf stat command runs the Poly2tri executable (./main) passing it debug2.dat, a test data file containing 10000 points to triangulate.
  • The --detailed flag will cause perf to output more counters.
  • The --repeat 10 flag reruns the executable ten times and averages the results as well as printing the standard deviation between each run.
  • The 0 0 1 at the end are arguments that Poly2tri needs for the OpenGL visualization, but are not used in the modified testbed.

This is the result with the interesting parts highlighted:

  • The first highlighted row shows the execution time in milliseconds. The rightmost column is the standard deviation between runs. A low standard deviation means the program is stable across runs.

  • The second highlighted row shows the instructions per cycle, an important metric that can give you hints on whether the CPU is stalling during execution.

    The Intel® 64 and IA-32 Architectures Optimization Reference Manual defines this metric as Clocks per Instructions Retired Ratio (CPI), and uses the inverse of what perf shows. The theoretical best on modern superscalar CPUs executing at full speed is 4 instructions per cycle. In the case of Poly2tri, 1.29 instructions per cycle indicate possible front-end or back-end stalls. A deeper analysis is required. Take a look at Tuning Applications Using a Top-down Microarchitecture Analysis Method for more on the subject.

  • The third highlighted row shows the number of L1 cache misses. Perf highlighted the percentage in yellow to indicate that this may be a potential hotspot and should be investigated further.

Finding the hot spots with perf report

Now that I had a baseline to work with, I used the next perf subcommand: record.

taskset 2 perf record -e cpu-clock,L1-dcache-load-misses ./main testbed/data/debug2.dat 0 0 1

The previous command will sample the execution of the program and read the event counters specified by the -e flag:

  • cpu-clock: measures the time spent in a function. This is useful for gauging the amount the functions we should be focusing on.
  • L1-dcache-load-misses: will show the functions that triggered the most L1 cache misses

The recorded data can be viewed with perf report and looks like this:

The first set represents the top 5 functions sorted by cpu-clock with the percentage column on the left that indicates the amount of time spent overall. The second set of functions represent where L1 cache misses happened.

The functions of interest that appear in both sets are:

  • p2t::Triangle::MarkNeighbor: (see on Github) marks a triangle as a neighbor if they share two points along an edge.
  • __ieee754_atan2_avx: (see on Github) is the libm implementation of the atan2 function.
  • p2t::AdvancingFront::LocateNode: (see on Github) traverses the doubly-linked-list of the advancing front looking for the next node to process.

Digging deeper with perf annotate

Now that we have a few hot spots to look at, we can fire up another perf subcommand: annotate.

Actually, instead of calling annotate directly, I find it easier to jump to it from perf report by selecting the function of interest and pressing a.

Cycling through the hottest instructions in the report for cpu-clock of p2t::Triangle::MarkNeighbor (by pressing TAB), most of hot spots are caused by this:

It corresponds to the function p2t::Triangle::Contains (see on Github) that is inlined by the optimizer.

The reason the cmp %rcx,%rdx takes up so many cycles is probably because of a CPU stall caused by either a cache miss or another dependent instruction.

Looking the report for L1-dcache-load-misses seems to confirm the suspicion:

The other function on the hot spot list, p2t::AdvancingFront::LocateNode, also suffers from a similar issue: the CPU is having to wait for memory access because of heavy usage of doubly linked lists.

Ways to optimize these sorts of CPU stalls involve improving the cache locality of the data (i.e. keep the data that needs to be worked on closer together in memory). This can’t always be done directly if the underlying algorithm has certain constraints, but often times there is a better way.

An excellent video on the subject is Mike Acton’s great Data-Oriented Design and C++ talk.

Time to Optimize

Before I went about focusing on optimizing cache locality, I wanted to go for the low hanging fruits.

Bye bye new and push_back

For the first optimization pass I decided to remove all dynamic allocations caused by calls to std::vector.push_back and new. I never use dynamic allocation in performance critical code, and instead use many types of custom allocators depending on the needs.

I replaced these allocations with a simple fixed-size push allocator that partitions a large memory block that is passed in to the library during initialization. The push allocator looks something like this:

void* Push(PolyPushAllocator* Allocator, size_t Size, size_t AlignmentInBytes)
{
  void* Result = 0;
  size_t AlignmentOffset = 0;
  size_t MemoryAddress = (size_t)Allocator->Memory;
  if(MemoryAddress & (AlignmentInBytes-1))
  {
    AlignmentOffset = AlignmentInBytes - (MemoryAddress & (AlignmentInBytes-1));
  }
  Size += AlignmentOffset;
  if (Allocator->Used + Size <= Allocator->Size)
  {
    Result = Allocator->Memory + (Allocator->Used + AlignmentOffset);
    Allocator->Used += Size;
  }
  return Result;
}

As you can see, it’s very simple and is lightning fast compared to malloc/new.

This optimization resulted in a ~39% performance increase from the baseline on the debug2.dat file and is proportional to the number of points the library has to process: the more points in the data set there are, the slower dynamic allocations will become.

Doubles to floats

The next optimization I had on my checklist was changing the type for floating point from double to float.

My assumption was that since the precision of 64bit floating point numbers was not needed for this type of library, why not use floats and allow the CPU potentially increase bandwidth and parallelization.

The result was a tiny improvement in performance, mostly because the algorithm doesn’t do any wide operations of floating point numbers that would benefit from the increased bandwidth.

Faster Atan2 function

As seen above, the atan2 is one of the hot spots of Poly2tri, so I tried to replace it with an approximate version of the function with a good enough error.

The implementation I ended up with is based on this fast atan2 approximation and has a maximum error of 0.005 radians. That is perfectly acceptable for my needs.

Optimizing Cache Misses

The last item on my list was cache misses.

While working on the previous optimizations I gained a better understanding on the algorithm Poly2tri is aimed at implementing. The heavy usage of linked lists and double pointers was a sensible decision by the original implementer, and honestly I can’t really thing of a more performant way of doing things.

In the end I tried my best to reduce cache pressure with micro optimizations here and there. The final result is acceptable, but code in some sections may have become more cryptic for people trying to understand the algorithm.

This is what I did:

  • Replaced the std::vector to the Edges in pt2::Point to a linked list. The edges initialized are once and not accessed often, so this change allowed for easier memory management.
  • Replaced the bool arrays on pt2::Triangle (see on Github) with an unsigned int field that is used as a bit mask. The advantage is that bit operations are very cheap and avoid may of the branches used in the original implementation.
  • Removed many of the branches that caused cache misses like the one showcased earlier.
  • Used a global lookup table for faster access of triangle array indices when rotating left or right from a point.

The final, optimized, result on the debug2.dat sample data with 100 iterations resulted in ~100% speed increase: