Chronicle of Optimization with perf and C
Note: You can find the source code for the optimized version of poly2tri on Github.
I needed a fast and robust mesh triangulation solution for Metric Panda Engine and after some research I found the excellent Poly2Tri library, an implementation in C++ of the Sweep-line algorithm for constrained Delaunay triangulation.
Metric Panda Engine is written in C-style C++, so I wanted to port the code for poly2tri and customize it to fit the engine’s needs.
While I ported the code I noticed opportunities for optimization, and in this post I’ll talk about what I did.
Preparing the library for profiling
Before I went ahead and made changes I wanted a baseline profile measurement.
Fortunately Poly2tri comes with a testbed with sample data that can be used to test the output of the library. The test application accepts a data file containing a list of point as input, runs the triangulation routine and displays the result using OpenGL.
The part I was interested in optimizing was the triangulation routine, so everything else that the test application did was noise that I had to minimize in order to get better readings while profiling.
To do so I modified the main.cc
and did the following:
- Removed the OpenGL visualization part
- Isolated all library related function calls and placed them in a loop to reduce the noise from the startup of the testbed application
The relevant part of the refactored testbed looks like this:
Note: The snippet above contained a call to new
and delete
in the initial version of the post. Creating the CDT on the stack saves the call to malloc
and shaves off a little time from the benchmarked times.
Bird’s eye view with perf stat
With Poly2tri ready to be profiled I jumped to my trusty perf, the profiling swiss army knife every Linux dev should have in their toolbox. On Windows, the best solution in probably Intel’s VTune, but I don’t have any experience with it.
The first perf
subcommand I always start with is stat:
taskset 2 perf stat --detailed --repeat 10 ./main testbed/data/debug2.dat 0 0 1
Several things are happening in this one-liner:
The
taskset 2
will force the perf execution to happen on the second CPU core and avoid any random core migration that may muddy up the results. This can be safely done when dealing with single-threaded code, like it’s the case for Poly2tri.- The
perf stat
command runs the Poly2tri executable (./main
) passing itdebug2.dat
, a test data file containing 10000 points to triangulate. - The
--detailed
flag will cause perf to output more counters. - The
--repeat 10
flag reruns the executable ten times and averages the results as well as printing the standard deviation between each run. - The
0 0 1
at the end are arguments that Poly2tri needs for the OpenGL visualization, but are not used in the modified testbed.
This is the result with the interesting parts highlighted:
The first highlighted row shows the execution time in milliseconds. The rightmost column is the standard deviation between runs. A low standard deviation means the program is stable across runs.
The second highlighted row shows the instructions per cycle, an important metric that can give you hints on whether the CPU is stalling during execution.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual defines this metric as Clocks per Instructions Retired Ratio (CPI), and uses the inverse of what
perf
shows. The theoretical best on modern superscalar CPUs executing at full speed is 4 instructions per cycle. In the case of Poly2tri, 1.29 instructions per cycle indicate possible front-end or back-end stalls. A deeper analysis is required. Take a look at Tuning Applications Using a Top-down Microarchitecture Analysis Method for more on the subject.The third highlighted row shows the number of L1 cache misses. Perf highlighted the percentage in yellow to indicate that this may be a potential hotspot and should be investigated further.
Finding the hot spots with perf report
Now that I had a baseline to work with, I used the next perf
subcommand: record.
taskset 2 perf record -e cpu-clock,L1-dcache-load-misses ./main testbed/data/debug2.dat 0 0 1
The previous command will sample the execution of the program and read the event counters specified by the -e
flag:
cpu-clock
: measures the time spent in a function. This is useful for gauging the amount the functions we should be focusing on.L1-dcache-load-misses
: will show the functions that triggered the most L1 cache misses
The recorded data can be viewed with perf report
and looks like this:
The first set represents the top 5 functions sorted by cpu-clock
with the percentage column on the left that indicates the amount of time spent overall. The second set of functions represent where L1 cache misses happened.
The functions of interest that appear in both sets are:
p2t::Triangle::MarkNeighbor
: (see on Github) marks a triangle as a neighbor if they share two points along an edge.__ieee754_atan2_avx
: (see on Github) is thelibm
implementation of the atan2 function.p2t::AdvancingFront::LocateNode
: (see on Github) traverses the doubly-linked-list of the advancing front looking for the next node to process.
Digging deeper with perf annotate
Now that we have a few hot spots to look at, we can fire up another perf subcommand: annotate.
Actually, instead of calling annotate directly, I find it easier to jump to it from perf report
by selecting the function of interest and pressing a
.
Cycling through the hottest instructions in the report for cpu-clock
of p2t::Triangle::MarkNeighbor
(by pressing TAB
), most of hot spots are caused by this:
It corresponds to the function p2t::Triangle::Contains
(see on Github) that is inlined by the optimizer.
The reason the cmp %rcx,%rdx
takes up so many cycles is probably because of a CPU stall caused by either a cache miss or another dependent instruction.
Looking the report for L1-dcache-load-misses
seems to confirm the suspicion:
The other function on the hot spot list, p2t::AdvancingFront::LocateNode
, also suffers from a similar issue: the CPU is having to wait for memory access because of heavy usage of doubly linked lists.
Ways to optimize these sorts of CPU stalls involve improving the cache locality of the data (i.e. keep the data that needs to be worked on closer together in memory). This can’t always be done directly if the underlying algorithm has certain constraints, but often times there is a better way.
An excellent video on the subject is Mike Acton’s great Data-Oriented Design and C++ talk.
Time to Optimize
Before I went about focusing on optimizing cache locality, I wanted to go for the low hanging fruits.
Bye bye new
and push_back
For the first optimization pass I decided to remove all dynamic allocations caused by calls to std::vector.push_back
and new
. I never use dynamic allocation in performance critical code, and instead use many types of custom allocators depending on the needs.
I replaced these allocations with a simple fixed-size push allocator that partitions a large memory block that is passed in to the library during initialization. The push allocator looks something like this:
As you can see, it’s very simple and is lightning fast compared to malloc
/new
.
This optimization resulted in a ~39% performance increase from the baseline on the debug2.dat
file and is proportional to the number of points the library has to process: the more points in the data set there are, the slower dynamic allocations will become.
Doubles to floats
The next optimization I had on my checklist was changing the type for floating point from double
to float
.
My assumption was that since the precision of 64bit floating point numbers was not needed for this type of library, why not use floats
and allow the CPU potentially increase bandwidth and parallelization.
The result was a tiny improvement in performance, mostly because the algorithm doesn’t do any wide operations of floating point numbers that would benefit from the increased bandwidth.
Faster Atan2 function
As seen above, the atan2 is one of the hot spots of Poly2tri, so I tried to replace it with an approximate version of the function with a good enough error.
The implementation I ended up with is based on this fast atan2 approximation and has a maximum error of 0.005 radians. That is perfectly acceptable for my needs.
Optimizing Cache Misses
The last item on my list was cache misses.
While working on the previous optimizations I gained a better understanding on the algorithm Poly2tri is aimed at implementing. The heavy usage of linked lists and double pointers was a sensible decision by the original implementer, and honestly I can’t really thing of a more performant way of doing things.
In the end I tried my best to reduce cache pressure with micro optimizations here and there. The final result is acceptable, but code in some sections may have become more cryptic for people trying to understand the algorithm.
This is what I did:
- Replaced the
std::vector
to the Edges inpt2::Point
to a linked list. The edges initialized are once and not accessed often, so this change allowed for easier memory management. - Replaced the
bool
arrays onpt2::Triangle
(see on Github) with an unsigned int field that is used as a bit mask. The advantage is that bit operations are very cheap and avoid may of the branches used in the original implementation. - Removed many of the branches that caused cache misses like the one showcased earlier.
- Used a global lookup table for faster access of triangle array indices when rotating left or right from a point.
The final, optimized, result on the debug2.dat
sample data with 100 iterations resulted in ~100% speed increase: